Analysis of Childcare Costs in Relation to Family Income and Employment Data¶

Prepared by: Matthew Walker, Neysha Pagán Vargas

Date: Decmber 15, 2024

INTRODUCTION¶

Childcare costs have become a critical concern for families across various income levels, impacting financial stability and employment decisions. This project, Analysis of Childcare Costs in Relation to Family Income and Employment Data, aims to explore the intricate relationships between childcare expenses, family income, and employment dynamics. By developing a robust data engineering pipeline, the project seeks to provide actionable insights into how childcare affordability varies across demographics, geographic regions, and employment statuses.

The primary goal of this project is to assess how childcare costs influence financial stability and employment opportunities, especially for women in the workforce. Using a combination of static datasets and API-sourced data, the project will build a scalable and efficient data processing system that ensures data accuracy and accessibility. The analysis will focus on key metrics, including childcare affordability as a percentage of family income, regional disparities, and correlations between employment rates and childcare burdens.

The project explores multiple database approaches to ensure optimal data storage, retrieval, and analysis. While MongoDB is utilized for its flexibility and scalability in handling semi-structured data, Cassandra is considered for its high availability and write performance across distributed environments, making it suitable for large-scale, geographically dispersed data. Additionally, Elasticsearch is included as an alternative for its full-text search capabilities and fast querying, enabling advanced data exploration and trend analysis.

This notebook will guide users through the development and implementation of the data pipeline, utilizing tools like PySpark for data transformation, MinIO for object storage, and various database technologies (MongoDB, Cassandra, and Elasticsearch) for data querying and analysis. Visualizations generated in Jupyter Notebooks and Kibana will provide a clear depiction of trends and correlations, enabling stakeholders to understand the challenges and opportunities in childcare affordability.

By combining data engineering best practices with detailed analysis and a diversified database approach, this project aims to uncover meaningful insights and inform policy recommendations for making childcare more accessible and equitable.

PROJECT OVERVIEW¶

This project aims to develop a data engineering pipeline to ingest, process, and analyze data on childcare costs alongside employment and family income data. The objective is to assess how childcare costs impact family financial stability, focusing on factors such as employment rates, income distribution, and affordability of childcare across different states and counties. This proposal will outline the key datasets, potential analyses, and the data engineering methodologies involved in constructing an effective data pipeline.

Dataset Descriptions¶

All data is from 2018¶

  1. Childcare Prices as a Share of Median Family Income

    • Source: Department of Labor (DOL)
    • URL: Median Family Income by Age and Care Setting
    • Description:
      • Provides data on childcare costs for different care settings (center-based and home-based) and age groups (infant, toddler, preschool, and school-age).
      • Includes county-level information on childcare prices as a share of median family income.
    • Key variables:
      • Care settings (e.g., center-based, home-based)
      • Age groups
      • Median family income
      • Percentage of income spent on childcare.
  2. Labor Force Statistics: Employment and Earnings

    • Source: Bureau of Labor Statistics (BLS)
    • URL: CPS Annual Averages for 2018
    • Dataset Name: 8. Employed and unemployed full- and part-time workers by age, sex, race, and Hispanic or Latino ethnicity
    • Description:
      • Provides detailed labor force statistics, focusing on employment rates and earnings.
      • Worksheets used:
        • CPSAAT08: Employed and unemployed full- and part-time workers by age, sex, race, and Hispanic or Latino ethnicity .
    • Key variables:
      • Employment status (full-time, part-time)
      • Median wages
      • Demographic breakdowns (age, gender, and occupation).
  3. Median Family Income Data

    • Source: National Center for Education Statistics (NCES)
    • URL: Median Income by State
    • Description:
      • Contains state-level averages for median family income.
      • Similar to the DOL dataset but aggregated at the state level, losing granularity at the county level.
      • Potential redundancy, as it mirrors data from the DOL.

Techniques and Methodologies¶

This project prioritizes the development and implementation of robust data engineering practices to handle and process large-scale datasets related to childcare costs, family income, and employment statistics. The emphasis is on building a scalable and efficient pipeline that ensures data accuracy, reliability, and readiness for analysis. While some visualizations will be included to summarize results, the primary focus is to demonstrate the end-to-end data engineering process. Below is a detailed outline of the techniques and methodologies employed:

1. Data Ingestion¶

  • Sources:
    • CSV and Excel files from Department of Labor (DOL), US Census Bureau (USCB), and National Center for Education Statistics (NCES).
    • Worksheet (CPSAAT08) from the Bureau of Labor Statistics (BLS).
  • Tools:
    • MinIO Object Storage:
      • Used to store raw files in an organized manner.
      • Files are partitioned by source and year for efficient retrieval.
    • PySpark:
      • Reads and ingests datasets into Spark DataFrames for distributed processing.

2. Data Preprocessing¶

  • Handling Missing and Inconsistent Data:
    • Impute missing values where applicable.
    • Remove or standardize invalid entries.
  • Standardization:
    • Ensure consistent formats for numeric fields (e.g., median income, childcare costs).
    • Standardize categorical fields (e.g., state names, care settings).
  • Schema Validation:
    • Use schema enforcement during data ingestion to validate data types and structures.
    • Reject invalid rows and log issues for auditability.

3. Data Transformation¶

  • ETL Pipeline:
    • Extract:
      • Load raw datasets from MinIO into PySpark DataFrames.
    • Transform:
      • Join datasets on common fields (e.g., state, county, year).
      • Compute derived metrics such as:
        • Childcare costs as a percentage of median family income.
        • Employment rates correlated with childcare affordability.
      • Aggregate data to state and county levels for comparison.
    • Load:
      • Save the cleaned and transformed data into partitioned collections in MongoDB, Cassandra, and Elasticsearch.
  • Feature Engineering:
    • Create additional columns for demographic aggregations and geographic comparisons.
    • Calculate trends over time to support historical analysis.

4. Data Storage¶

  • Databases Used:
    • Cassandra:
      • For scalable, high-availability storage of structured data.
      • Supports efficient querying for region-specific and demographic analyses.
    • MongoDB:
      • To handle semi-structured data, such as demographic and childcare details.
      • Allows flexible querying of nested fields for detailed insights.
    • Elasticsearch:
      • Provides fast querying capabilities for text-based data exploration and search-driven analysis.
  • Partitioning:
    • Data is partitioned by region (state, county) and year to optimize storage and retrieval performance.

5. Visualization¶

  • Purpose:
    • To verify data transformations and provide summaries of key metrics.
  • Tools:
    • Matplotlib and Plotly:
      • Create visualizations such as:
        • Geographic distributions of childcare costs as a percentage of income.
        • Correlations between employment rates and childcare costs.
        • Trend lines for historical comparisons.
  • Scope:
    • Minimal visualizations, primarily used to validate the results of the pipeline and illustrate key findings.

Outcome¶

The project will produce a reliable, scalable data engineering pipeline that ingests, cleans, and transforms datasets into a unified, queryable format. By focusing on data engineering, the project ensures that the data is ready for downstream applications, including policy analysis, detailed reporting, and decision-making. Visualizations serve as supplementary outputs, while the primary deliverable is the robust infrastructure built to process and manage complex datasets.

LIBRARIES¶

In [1]:
pip install boto3
Requirement already satisfied: boto3 in /opt/conda/lib/python3.9/site-packages (1.35.81)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.9/site-packages (from boto3) (1.0.1)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /opt/conda/lib/python3.9/site-packages (from boto3) (0.10.4)
Requirement already satisfied: botocore<1.36.0,>=1.35.81 in /opt/conda/lib/python3.9/site-packages (from boto3) (1.35.81)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.81->boto3) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.81->boto3) (1.26.6)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.9/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.36.0,>=1.35.81->boto3) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [2]:
pip install openpyxl
Requirement already satisfied: openpyxl in /opt/conda/lib/python3.9/site-packages (3.1.5)
Requirement already satisfied: et-xmlfile in /opt/conda/lib/python3.9/site-packages (from openpyxl) (2.0.0)
Note: you may need to restart the kernel to use updated packages.
In [3]:
pip install -q cassandra-driver
Note: you may need to restart the kernel to use updated packages.
In [4]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import os
import subprocess
from urllib.request import Request, urlopen, urlretrieve  
from urllib.error import HTTPError

# install if needed: pip install boto3 , openpyxl
import boto3
from openpyxl import load_workbook
from pyspark.sql.functions import lit
import pandas as pd
from IPython.display import display
import re
from pyspark.sql.functions import col
import json
from cassandra.cluster import Cluster

USER DEFINED FUNCTIONS¶

In [5]:
# 1. Function to sanitize a single column name
def sanitize_column(name):
    """
    Sanitizes a column name by replacing special characters and standardizing the format.
    - Replaces '%' with '_percent' and '$' with '_dollar'.
    - Converts spaces and special characters into underscores.
    - Strips leading/trailing underscores and converts to lowercase.
    """
    # Replace '%' and '$' with meaningful labels
    name = name.replace('%', '_percent')
    name = name.replace('$', '_dollar')
    # Replace special characters with underscores
    name = re.sub(r'[^\w\s]', '_', name)  # Retain alphanumeric and spaces
    # Replace spaces with underscores and collapse multiple underscores
    name = re.sub(r'\s+', '_', name)
    name = re.sub(r'_+', '_', name)
    # Remove leading/trailing underscores and convert to lowercase
    return name.strip('_').lower()

# 2. Function to apply sanitization to all columns in a DataFrame
def sanitize_dataframe_columns(dataframe):
    """
    Sanitizes all column names in a Pandas DataFrame.
    """
    sanitized_columns = [sanitize_column(col) for col in dataframe.columns]
    dataframe.columns = sanitized_columns
    return dataframe

PART I: Data Ingestion¶

The datasets for this project are retrieved directly from publicly available sources, including government and organizational websites. Each dataset is downloaded programmatically or manually (if necessary) from its respective source link. The data files are then stored locally before being uploaded to MinIO, which serves as the centralized storage solution for this project.

MINIO Storage¶

MinIO is used as the landing storage for all data sources.

  • Connect to MinIO from Jupyter Terminal

    image.png

  • Create new bucket for the project

    image.png

  • Verify new bucket

    image.png

  • Download the datasets and upload them to MinIO

In [ ]:
# Define dataset URLs and filenames
datasets = [
    ("https://nces.ed.gov/programs/digest/d22/tables/xls/tabn102.30.xlsx", "tabn102.30.xlsx"),
    ("https://www.dol.gov/sites/dolgov/files/WB/NDCP/Childcare-Prices-by-Age-Children-Care-Setting-Data-Table-2023.xlsx", "Childcare-Prices-2023.xlsx"),
    ("https://www.bls.gov/cps/aa2018/cpsaat08.xlsx", "cpsaat08.xlsx"),
]

# Manual download files
manual_files = [
    ("https://data.census.gov/table/ACSST1Y2018.S2002?t=Income%20and%20Poverty&g=010XX00US,$0400000&y=2018", 
     "ACSST1Y2018.S2002-2024-11-10T090744.csv")
]

# Directory to store downloaded files
download_dir = "datasets"
os.makedirs(download_dir, exist_ok=True)

# Download datasets
print("Downloading datasets...")
for url, filename in datasets:
    local_path = os.path.join(download_dir, filename)
    if not os.path.exists(local_path):
        try:
            # Set headers
            req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})  # Use direct import of Request
            with urlopen(req) as response, open(local_path, 'wb') as out_file:  # Use direct import of urlopen
                out_file.write(response.read())
            print(f"Downloaded: {filename}")
        except HTTPError as e:
            print(f"Failed to download {filename}: {e}. Please download it manually.")
    else:
        print(f"Already exists: {filename}")

# Manual download notification
print("\nManual download required for the following files:")
for _, filename in manual_files:
    print(f"- {filename}. Please place it in {download_dir}.")

# Configure MinIO
minio_alias = "ms"
minio_bucket = "project"

# Set MinIO alias
print("\nConfiguring MinIO...")
subprocess.run(
    ["mc", "alias", "set", minio_alias, "http://minio:9000", "minio", "SU2orange!"],
    check=True
)

# Upload datasets to MinIO
print("\nUploading datasets to MinIO bucket 'project'...")
all_files = datasets + manual_files
for _, filename in all_files:
    local_path = os.path.join(download_dir, filename)
    if os.path.exists(local_path):  # Check if the file exists
        try:
            subprocess.run(
                ["mc", "cp", local_path, f"{minio_alias}/{minio_bucket}/"],
                check=True,
            )
            print(f"Uploaded: {filename}")
        except subprocess.CalledProcessError as e:
            print(f"Failed to upload {filename}: {e}")
    else:
        print(f"File not found: {filename}. Skipping upload.")

print("\nProcess completed!")

Note¶

The following files could not be downloaded programmatically due to access restrictions (HTTP Error 403) and had to be downloaded manually:

  1. cpsaat08.xlsx from the Bureau of Labor Statistics (BLS)
    • URL: BLS CPSAAT08 File
  2. ACSST1Y2018.S2002-2024-11-10T090744.csv from the U.S. Census Bureau
    • URL: Census Data

To ensure proper processing:

  1. Visit the respective URLs and download the files manually.
  2. Move the files into the directory /home/jovyan/datasets.

Once the files are in the correct directory, they will be included in subsequent data uploads to the MinIO bucket.


  • Verify the datasets in the bucket through MinIO webconsole

image.png


Dataset #1: Childcare Prices as a Share of Median Family Income¶

Problem Statement¶

The "Childcare Prices as a Share of Median Family Income" dataset contains detailed information on childcare costs across various U.S. states and counties. However, the dataset requires significant preprocessing to address missing values, ensure consistent column naming, and prepare it for efficient storage and querying. Without these preprocessing steps, the dataset would be prone to errors and challenging to analyze effectively, limiting its usability for decision-making and analytics.

Data Processing¶

  • Handle Missing Values:

    • Identify columns with missing values and replace them with zero to maintain numeric consistency for calculations and analytics.
  • Standardize Column Names:

    • Sanitize column names by removing problematic characters (spaces, special symbols) while retaining meaningful information, such as $ for dollar values and % for percentages.
    • Apply a consistent naming format (e.g., snake_case) for easier querying and analytics.
  • Validate Data Consistency:

    • Group data by states (Reference_SheetName) and ensure proper counts for each category.
    • Ensure data completeness and accuracy of data across rows and columns.
  • Prepare Data for Querying:

    • Clean numeric columns to eliminate invalid values.
    • Create an optimized, structured dataset ready for database insertion.

Data Storage¶

Cassandra is chosen as the preferred storage solution due to the follwing advantages:

  • High Scalability: Cassandra can handle large datasets efficiently, which is ideal for storing multi-state and multi-county data across the U.S.
  • Distributed Data Model: Its distributed architecture ensures quick access and querying, even for geographically distributed analytics
  • Efficient Querying: The schema can be designed with partition keys (e.g., State, County) to enable fast retrieval of state or county-level information.
  • Write-Optimized: Cassandra is optimized for write-heavy operations, making it suitable for inserting large volumes of cleaned data.
  • Denormalized Data: Cassandra’s table structure aligns well with flattened, denormalized data, where all relevant information can be queried from a single table.

Note: Childcare prices are derived from each state's childcare Market Rate Survey. Prices are median yearly prices for one child at the market rate. School-age prices reflect the school-year arrangement (part day). Childcare prices are based on the 2016-2018 data collection cycle and are presented in 2018 and 2023 real dollars using the CPI-U for child care (day care and preschool in the U.S. city average). Economic and demographic data are obtained from the 2014-2018 American Community Survey to correspond to the 2016-2018 childcare price data.


PART I: Ingest data from MinIO into pySpark dataframe¶

The following script automates the process of downloading an Excel file from a MinIO object storage bucket, reading all its sheets, and consolidating them into a single Spark DataFrame. Each sheet is processed individually, with its name added as a Reference_SheetName column to distinguish records from different sheets. The resulting unified DataFrame is well-suited for further analysis and transformation in Spark.

While an optional step to save this data in Parquet format is included—highlighting its benefits such as efficient storage, faster retrieval, and optimized query performance in big data environments—this project will not utilize Parquet files. However, this approach is commonly applied in professional, work-related scenarios where performance and scalability are critical.

In [47]:
# MinIO Configuration
s3_url = "http://minio:9000"
s3_key = "minio"
s3_secret = "SU2orange!"
s3_bucket = "project"
s3_file_key = "Childcare-Prices-2023.xlsx"

# Initialize MinIO client
s3 = boto3.client('s3',
                  endpoint_url=s3_url,
                  aws_access_key_id=s3_key,
                  aws_secret_access_key=s3_secret)

# Download Excel file locally from MinIO
local_file = "/tmp/Childcare-Prices-2023.xlsx"
s3.download_file(s3_bucket, s3_file_key, local_file)

# Read sheet names using pandas
sheet_names = pd.ExcelFile(local_file).sheet_names
print("Sheets in the Excel file:", sheet_names)

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("ExcelSheetsAnalysis") \
    .config("spark.sql.debug.maxToStringFields", "100") \
    .getOrCreate()

# Process each sheet and add to a unified DataFrame
final_df = None

for sheet in sheet_names:
    try:
        print(f"Loading sheet: {sheet}")
        # Read the sheet using pandas
        pandas_df = pd.read_excel(local_file, sheet_name=sheet)
        
        # Add the State_Name column
        pandas_df['Reference_SheetName'] = sheet
        
        # Convert pandas DataFrame to Spark DataFrame
        spark_df = spark.createDataFrame(pandas_df)
        
        # Combine DataFrames
        if final_df is None:
            final_df = spark_df
        else:
            final_df = final_df.unionByName(spark_df)
        
        print(f"Loaded sheet '{sheet}' with {spark_df.count()} rows and {len(spark_df.columns)} columns.")
    except Exception as e:
        print(f"Failed to load sheet '{sheet}': {e}")

# Show the combined DataFrame
if final_df:
    print("Combined DataFrame:")
    #final_df.show()
    final_df.printSchema()
else:
    print("No data was successfully loaded.")
Sheets in the Excel file: ['AllStates', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'DistrictOfColumbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'NewHampshire', 'NewJersey', 'NewMexico', 'NewYork', 'NorthCarolina', 'NorthDakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'RhodeIsland', 'SouthCarolina', 'SouthDakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'WestVirginia', 'Wisconsin', 'Wyoming']
Loading sheet: AllStates
Loaded sheet 'AllStates' with 3142 rows and 37 columns.
Loading sheet: Alabama
Loaded sheet 'Alabama' with 67 rows and 37 columns.
Loading sheet: Alaska
Loaded sheet 'Alaska' with 29 rows and 37 columns.
Loading sheet: Arizona
Loaded sheet 'Arizona' with 15 rows and 37 columns.
Loading sheet: Arkansas
Loaded sheet 'Arkansas' with 75 rows and 37 columns.
Loading sheet: California
Loaded sheet 'California' with 58 rows and 37 columns.
Loading sheet: Colorado
Loaded sheet 'Colorado' with 64 rows and 37 columns.
Loading sheet: Connecticut
Loaded sheet 'Connecticut' with 8 rows and 37 columns.
Loading sheet: Delaware
Loaded sheet 'Delaware' with 3 rows and 37 columns.
Loading sheet: DistrictOfColumbia
Loaded sheet 'DistrictOfColumbia' with 1 rows and 37 columns.
Loading sheet: Florida
Loaded sheet 'Florida' with 67 rows and 37 columns.
Loading sheet: Georgia
Loaded sheet 'Georgia' with 159 rows and 37 columns.
Loading sheet: Hawaii
Loaded sheet 'Hawaii' with 5 rows and 37 columns.
Loading sheet: Idaho
Loaded sheet 'Idaho' with 44 rows and 37 columns.
Loading sheet: Illinois
Loaded sheet 'Illinois' with 102 rows and 37 columns.
Loading sheet: Indiana
Loaded sheet 'Indiana' with 92 rows and 37 columns.
Loading sheet: Iowa
Loaded sheet 'Iowa' with 99 rows and 37 columns.
Loading sheet: Kansas
Loaded sheet 'Kansas' with 105 rows and 37 columns.
Loading sheet: Kentucky
Loaded sheet 'Kentucky' with 120 rows and 37 columns.
Loading sheet: Louisiana
Loaded sheet 'Louisiana' with 64 rows and 37 columns.
Loading sheet: Maine
Loaded sheet 'Maine' with 16 rows and 37 columns.
Loading sheet: Maryland
Loaded sheet 'Maryland' with 24 rows and 37 columns.
Loading sheet: Massachusetts
Loaded sheet 'Massachusetts' with 14 rows and 37 columns.
Loading sheet: Michigan
Loaded sheet 'Michigan' with 83 rows and 37 columns.
Loading sheet: Minnesota
Loaded sheet 'Minnesota' with 87 rows and 37 columns.
Loading sheet: Mississippi
Loaded sheet 'Mississippi' with 82 rows and 37 columns.
Loading sheet: Missouri
Loaded sheet 'Missouri' with 115 rows and 37 columns.
Loading sheet: Montana
Loaded sheet 'Montana' with 56 rows and 37 columns.
Loading sheet: Nebraska
Loaded sheet 'Nebraska' with 93 rows and 37 columns.
Loading sheet: Nevada
Loaded sheet 'Nevada' with 17 rows and 37 columns.
Loading sheet: NewHampshire
Loaded sheet 'NewHampshire' with 10 rows and 37 columns.
Loading sheet: NewJersey
Loaded sheet 'NewJersey' with 21 rows and 37 columns.
Loading sheet: NewMexico
Loaded sheet 'NewMexico' with 33 rows and 37 columns.
Loading sheet: NewYork
Loaded sheet 'NewYork' with 62 rows and 37 columns.
Loading sheet: NorthCarolina
Loaded sheet 'NorthCarolina' with 100 rows and 37 columns.
Loading sheet: NorthDakota
Loaded sheet 'NorthDakota' with 53 rows and 37 columns.
Loading sheet: Ohio
Loaded sheet 'Ohio' with 88 rows and 37 columns.
Loading sheet: Oklahoma
Loaded sheet 'Oklahoma' with 77 rows and 37 columns.
Loading sheet: Oregon
Loaded sheet 'Oregon' with 36 rows and 37 columns.
Loading sheet: Pennsylvania
Loaded sheet 'Pennsylvania' with 67 rows and 37 columns.
Loading sheet: RhodeIsland
Loaded sheet 'RhodeIsland' with 5 rows and 37 columns.
Loading sheet: SouthCarolina
Loaded sheet 'SouthCarolina' with 46 rows and 37 columns.
Loading sheet: SouthDakota
Loaded sheet 'SouthDakota' with 66 rows and 37 columns.
Loading sheet: Tennessee
Loaded sheet 'Tennessee' with 95 rows and 37 columns.
Loading sheet: Texas
Loaded sheet 'Texas' with 254 rows and 37 columns.
Loading sheet: Utah
Loaded sheet 'Utah' with 29 rows and 37 columns.
Loading sheet: Vermont
Loaded sheet 'Vermont' with 14 rows and 37 columns.
Loading sheet: Virginia
Loaded sheet 'Virginia' with 133 rows and 37 columns.
Loading sheet: Washington
Loaded sheet 'Washington' with 39 rows and 37 columns.
Loading sheet: WestVirginia
Loaded sheet 'WestVirginia' with 55 rows and 37 columns.
Loading sheet: Wisconsin
Loaded sheet 'Wisconsin' with 72 rows and 37 columns.
Loading sheet: Wyoming
Loaded sheet 'Wyoming' with 23 rows and 37 columns.
Combined DataFrame:
root
 |-- State name: string (nullable = true)
 |-- County name: string (nullable = true)
 |-- County FIPS code: long (nullable = true)
 |-- Infant center-based price: 2018 ($): double (nullable = true)
 |-- Infant center-based price: 2023 (estimated) ($): double (nullable = true)
 |-- Infant center-based price as share of family income: 2018 (%): double (nullable = true)
 |-- Infant home-based price: 2018 ($): double (nullable = true)
 |-- Infant home-based price: 2023 (estimated) ($): double (nullable = true)
 |-- Infant home-based price as share of family income: 2018 (%): double (nullable = true)
 |-- Toddler center-based price: 2018 ($): double (nullable = true)
 |-- Toddler center-based price: 2023 (estimated) ($): double (nullable = true)
 |-- Toddler center-based price as share of family income: 2018 (%): double (nullable = true)
 |-- Toddler home-based price: 2018 ($): double (nullable = true)
 |-- Toddler home-based price: 2023 (estimated) ($): double (nullable = true)
 |-- Toddler home-based price as share of family income: 2018 (%): double (nullable = true)
 |-- Preschool center-based price: 2018 ($): double (nullable = true)
 |-- Preschool center-based price: 2023 (estimated) ($): double (nullable = true)
 |-- Preschool center-based price as share of family income: 2018 (%): double (nullable = true)
 |-- Preschool home-based price: 2018 ($): double (nullable = true)
 |-- Preschool home-based price: 2023 (estimated) ($): double (nullable = true)
 |-- Preschool home-based price as share of family income: 2018 (%): double (nullable = true)
 |-- School-age center-based price: 2018 ($): double (nullable = true)
 |-- School-age center-based price: 2023 (estimated) ($): double (nullable = true)
 |-- School-age center-based price as share of family income: 2018 (%): double (nullable = true)
 |-- School-age home-based price: 2018 ($): double (nullable = true)
 |-- School-age home-based price:  2023 (estimated) ($): double (nullable = true)
 |-- School-age home-based price as share of family income: 2018 (%): double (nullable = true)
 |-- Women's labor force participation rate (%): double (nullable = true)
 |-- Women's median earnings ($): double (nullable = true)
 |-- Median family income ($): double (nullable = true)
 |-- Percent of families in poverty (%): double (nullable = true)
 |-- Total population: long (nullable = true)
 |-- Percent Asian (%): double (nullable = true)
 |-- Percent Black (%): double (nullable = true)
 |-- Percent Hispanic (of any race) (%): double (nullable = true)
 |-- Percent White (%): double (nullable = true)
 |-- Reference_SheetName: string (nullable = true)

In [ ]:
# Converting dataframe to Pandas
childcare_df = final_df.toPandas()

PART II: Dataset Preprocessing, Transformation, and Storage¶

This will include cleaning data, handling missing values, We'll ensure the data quality and integrity are maintained to ensure reliable analysis.

Description This code performs several essential data exploration and preprocessing tasks on the DataFrames.

Purpose To gather comprehensive information about the DataFrame's structure, statistical summary, and data quality, which is crucial for effective data analysis and preprocessing.

Get Shape and Unique Values¶

In [50]:
# Check unique values in the Reference_SheetName
print(childcare_df['Reference_SheetName'].unique())
print(" ")

# Group by 'Reference_SheetName' and count rows
row_counts = childcare_df.groupby('Reference_SheetName').size().reset_index(name='Total_Rows')

# Sort the results by 'Total_Rows' in descending order
row_counts = row_counts.sort_values(by='Total_Rows', ascending=False)

# Display the result
print(row_counts)

# Get Data Frame shape
#print(f"DataFrame shape: {childcare_df.shape}")
print(" ")

# Display DataFrame info
childcare_df.info()
['AllStates' 'Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California'
 'Colorado' 'Connecticut' 'Delaware' 'DistrictOfColumbia' 'Florida'
 'Georgia' 'Hawaii' 'Idaho' 'Illinois' 'Indiana' 'Iowa' 'Kansas'
 'Kentucky' 'Louisiana' 'Maine' 'Maryland' 'Massachusetts' 'Michigan'
 'Minnesota' 'Mississippi' 'Missouri' 'Montana' 'Nebraska' 'Nevada'
 'NewHampshire' 'NewJersey' 'NewMexico' 'NewYork' 'NorthCarolina'
 'NorthDakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'RhodeIsland'
 'SouthCarolina' 'SouthDakota' 'Tennessee' 'Texas' 'Utah' 'Vermont'
 'Virginia' 'Washington' 'WestVirginia' 'Wisconsin' 'Wyoming']
 
   Reference_SheetName  Total_Rows
2            AllStates        3142
44               Texas         254
11             Georgia         159
47            Virginia         133
18            Kentucky         120
26            Missouri         115
17              Kansas         105
14            Illinois         102
34       NorthCarolina         100
16                Iowa          99
43           Tennessee          95
28            Nebraska          93
15             Indiana          92
36                Ohio          88
24           Minnesota          87
23            Michigan          83
25         Mississippi          82
37            Oklahoma          77
4             Arkansas          75
50           Wisconsin          72
39        Pennsylvania          67
0              Alabama          67
10             Florida          67
42         SouthDakota          66
6             Colorado          64
19           Louisiana          64
33             NewYork          62
5           California          58
27             Montana          56
49        WestVirginia          55
35         NorthDakota          53
41       SouthCarolina          46
13               Idaho          44
48          Washington          39
38              Oregon          36
32           NewMexico          33
1               Alaska          29
45                Utah          29
21            Maryland          24
51             Wyoming          23
31           NewJersey          21
29              Nevada          17
20               Maine          16
3              Arizona          15
46             Vermont          14
22       Massachusetts          14
30        NewHampshire          10
7          Connecticut           8
40         RhodeIsland           5
12              Hawaii           5
8             Delaware           3
9   DistrictOfColumbia           1
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6284 entries, 0 to 6283
Data columns (total 37 columns):
 #   Column                                                             Non-Null Count  Dtype  
---  ------                                                             --------------  -----  
 0   State name                                                         6284 non-null   object 
 1   County name                                                        6284 non-null   object 
 2   County FIPS code                                                   6284 non-null   int64  
 3   Infant center-based price: 2018 ($)                                5636 non-null   float64
 4   Infant center-based price: 2023 (estimated) ($)                    5636 non-null   float64
 5   Infant center-based price as share of family income: 2018 (%)      5636 non-null   float64
 6   Infant home-based price: 2018 ($)                                  5596 non-null   float64
 7   Infant home-based price: 2023 (estimated) ($)                      5596 non-null   float64
 8   Infant home-based price as share of family income: 2018 (%)        5596 non-null   float64
 9   Toddler center-based price: 2018 ($)                               5636 non-null   float64
 10  Toddler center-based price: 2023 (estimated) ($)                   5636 non-null   float64
 11  Toddler center-based price as share of family income: 2018 (%)     5636 non-null   float64
 12  Toddler home-based price: 2018 ($)                                 5596 non-null   float64
 13  Toddler home-based price: 2023 (estimated) ($)                     5596 non-null   float64
 14  Toddler home-based price as share of family income: 2018 (%)       5596 non-null   float64
 15  Preschool center-based price: 2018 ($)                             5636 non-null   float64
 16  Preschool center-based price: 2023 (estimated) ($)                 5636 non-null   float64
 17  Preschool center-based price as share of family income: 2018 (%)   5636 non-null   float64
 18  Preschool home-based price: 2018 ($)                               5596 non-null   float64
 19  Preschool home-based price: 2023 (estimated) ($)                   5596 non-null   float64
 20  Preschool home-based price as share of family income: 2018 (%)     5596 non-null   float64
 21  School-age center-based price: 2018 ($)                            5636 non-null   float64
 22  School-age center-based price: 2023 (estimated) ($)                5636 non-null   float64
 23  School-age center-based price as share of family income: 2018 (%)  5636 non-null   float64
 24  School-age home-based price: 2018 ($)                              5596 non-null   float64
 25  School-age home-based price:  2023 (estimated) ($)                 5596 non-null   float64
 26  School-age home-based price as share of family income: 2018 (%)    5596 non-null   float64
 27  Women's labor force participation rate (%)                         6284 non-null   float64
 28  Women's median earnings ($)                                        6284 non-null   float64
 29  Median family income ($)                                           6284 non-null   float64
 30  Percent of families in poverty (%)                                 6284 non-null   float64
 31  Total population                                                   6284 non-null   int64  
 32  Percent Asian (%)                                                  6284 non-null   float64
 33  Percent Black (%)                                                  6284 non-null   float64
 34  Percent Hispanic (of any race) (%)                                 6284 non-null   float64
 35  Percent White (%)                                                  6284 non-null   float64
 36  Reference_SheetName                                                6284 non-null   object 
dtypes: float64(32), int64(2), object(3)
memory usage: 1.8+ MB

Get summary statistics¶

In [51]:
# Get summary statistics for the DataFrame
print("\nSummary Statistics:")
childcare_df.describe()
Summary Statistics:
Out[51]:
County FIPS code Infant center-based price: 2018 ($) Infant center-based price: 2023 (estimated) ($) Infant center-based price as share of family income: 2018 (%) Infant home-based price: 2018 ($) Infant home-based price: 2023 (estimated) ($) Infant home-based price as share of family income: 2018 (%) Toddler center-based price: 2018 ($) Toddler center-based price: 2023 (estimated) ($) Toddler center-based price as share of family income: 2018 (%) Toddler home-based price: 2018 ($) Toddler home-based price: 2023 (estimated) ($) Toddler home-based price as share of family income: 2018 (%) Preschool center-based price: 2018 ($) Preschool center-based price: 2023 (estimated) ($) Preschool center-based price as share of family income: 2018 (%) Preschool home-based price: 2018 ($) Preschool home-based price: 2023 (estimated) ($) Preschool home-based price as share of family income: 2018 (%) School-age center-based price: 2018 ($) School-age center-based price: 2023 (estimated) ($) School-age center-based price as share of family income: 2018 (%) School-age home-based price: 2018 ($) School-age home-based price: 2023 (estimated) ($) School-age home-based price as share of family income: 2018 (%) Women's labor force participation rate (%) Women's median earnings ($) Median family income ($) Percent of families in poverty (%) Total population Percent Asian (%) Percent Black (%) Percent Hispanic (of any race) (%) Percent White (%)
count 6284.000000 5636.000000 5636.000000 5636.000000 5596.000000 5596.000000 5596.000000 5636.000000 5636.000000 5636.000000 5596.000000 5596.000000 5596.000000 5636.000000 5636.000000 5636.000000 5596.000000 5596.000000 5596.000000 5636.000000 5636.000000 5636.000000 5596.000000 5596.000000 5596.000000 6284.000000 6284.000000 6284.000000 6284.000000 6.284000e+03 6284.000000 6284.000000 6284.000000 6284.000000
mean 30383.649268 8241.184469 9700.303909 12.910712 6390.337971 7521.761058 10.145962 7454.270260 8774.064847 11.723498 6042.372316 7112.187334 9.608443 6943.904437 8173.337657 10.950372 5886.502211 6928.720092 9.358120 5817.488021 6847.486791 9.219063 5252.829916 6182.854754 8.384417 69.505952 24485.186139 64062.194213 11.238447 1.027699e+05 1.373456 9.059771 9.262826 83.055761
std 15161.301696 2880.559603 3390.568878 3.209779 1828.062093 2151.724419 2.322280 2363.327871 2781.760155 2.655799 1654.629996 1947.585797 2.132269 2121.796762 2497.465443 2.455602 1616.341033 1902.517690 2.094380 1961.018409 2308.220937 2.576315 1553.794978 1828.897722 2.203071 7.812962 4793.042660 15863.117635 5.664165 3.298815e+05 2.814414 14.529875 13.789355 16.869443
min 1001.000000 1441.960000 1697.262120 2.422200 2478.320000 2917.111888 4.608798 1120.080000 1318.392574 1.881507 2442.440000 2874.879257 4.228521 1120.080000 1318.392574 1.881507 2081.560000 2450.104676 3.797497 986.960000 1161.703391 1.657893 1721.200000 2025.942163 2.723350 33.600000 7336.000000 21816.000000 0.000000 7.500000e+01 0.000000 0.000000 0.000000 3.900000
25% 18177.000000 6487.000000 7635.537306 10.810312 5105.333220 6009.243450 8.627784 5982.599917 7041.832103 10.001134 4827.414260 5682.118341 8.129308 5509.400000 6484.851123 9.327111 4722.640000 5558.793572 7.947756 4622.918118 5441.415716 7.572226 4290.000000 5049.553730 6.960867 64.400000 21614.000000 53547.000000 7.300000 1.094800e+04 0.300000 0.700000 2.100000 76.600000
50% 29176.000000 7800.000000 9181.006781 12.704673 5999.760000 7062.030416 9.939216 7163.000000 8431.224561 11.511692 5720.000000 6732.738306 9.470798 6500.000000 7650.838984 10.777722 5634.100688 6631.630336 9.250211 5460.006855 6426.712815 9.189917 4907.256732 5776.097094 8.261921 70.000000 23926.000000 62128.000000 10.300000 2.573600e+04 0.600000 2.300000 4.100000 89.650000
75% 45081.000000 9360.000000 11017.208137 14.608065 7331.480000 8629.534307 11.424713 8320.000000 9793.073900 13.227062 6795.880000 7999.105175 10.795409 7852.000000 9242.213493 12.299005 6614.691406 7785.836750 10.542621 6943.560000 8172.932237 10.844965 6240.000000 7344.805425 9.689398 75.100000 26582.000000 71366.000000 13.900000 6.725300e+04 1.300000 10.200000 9.600000 95.100000
max 56045.000000 24440.000000 28767.154581 47.003046 22408.880000 26376.420415 23.361513 21788.000000 25645.612275 36.467839 19568.640000 23033.309812 22.539479 20020.000000 23564.584072 33.457612 17229.680000 20280.231912 24.428166 19520.800000 22976.999638 27.322515 15060.240000 17726.687893 19.576390 95.200000 62919.000000 178542.000000 52.100000 1.009805e+07 42.500000 87.400000 99.100000 100.000000

Check Null values¶

In [52]:
# Check for missing values in the DataFrame
print("\nMissing Values:")
childcare_df.isnull().sum()
Missing Values:
Out[52]:
State name                                                             0
County name                                                            0
County FIPS code                                                       0
Infant center-based price: 2018 ($)                                  648
Infant center-based price: 2023 (estimated) ($)                      648
Infant center-based price as share of family income: 2018 (%)        648
Infant home-based price: 2018 ($)                                    688
Infant home-based price: 2023 (estimated) ($)                        688
Infant home-based price as share of family income: 2018 (%)          688
Toddler center-based price: 2018 ($)                                 648
Toddler center-based price: 2023 (estimated) ($)                     648
Toddler center-based price as share of family income: 2018 (%)       648
Toddler home-based price: 2018 ($)                                   688
Toddler home-based price: 2023 (estimated) ($)                       688
Toddler home-based price as share of family income: 2018 (%)         688
Preschool center-based price: 2018 ($)                               648
Preschool center-based price: 2023 (estimated) ($)                   648
Preschool center-based price as share of family income: 2018 (%)     648
Preschool home-based price: 2018 ($)                                 688
Preschool home-based price: 2023 (estimated) ($)                     688
Preschool home-based price as share of family income: 2018 (%)       688
School-age center-based price: 2018 ($)                              648
School-age center-based price: 2023 (estimated) ($)                  648
School-age center-based price as share of family income: 2018 (%)    648
School-age home-based price: 2018 ($)                                688
School-age home-based price:  2023 (estimated) ($)                   688
School-age home-based price as share of family income: 2018 (%)      688
Women's labor force participation rate (%)                             0
Women's median earnings ($)                                            0
Median family income ($)                                               0
Percent of families in poverty (%)                                     0
Total population                                                       0
Percent Asian (%)                                                      0
Percent Black (%)                                                      0
Percent Hispanic (of any race) (%)                                     0
Percent White (%)                                                      0
Reference_SheetName                                                    0
dtype: int64
In [53]:
# Check for missing values in the Pandas DataFrame
missing_values = childcare_df.isnull().sum()

# Filter columns with null values
columns_with_nulls = missing_values[missing_values > 0].index.tolist()

# Check for null values grouped by 'Reference_SheetName'
null_summary = childcare_df.groupby('Reference_SheetName').apply(
    lambda group: group[columns_with_nulls].isnull().sum()
)

# Filter only rows where null values exist
null_summary = null_summary.loc[:, (null_summary > 0).any()]

# Set Pandas display options to show all rows
pd.set_option('display.max_rows', len(null_summary))
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Display the DataFrame as a table 
display(null_summary)

# Optionally reset display options
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
Infant center-based price: 2018 ($) Infant center-based price: 2023 (estimated) ($) Infant center-based price as share of family income: 2018 (%) Infant home-based price: 2018 ($) Infant home-based price: 2023 (estimated) ($) Infant home-based price as share of family income: 2018 (%) Toddler center-based price: 2018 ($) Toddler center-based price: 2023 (estimated) ($) Toddler center-based price as share of family income: 2018 (%) Toddler home-based price: 2018 ($) Toddler home-based price: 2023 (estimated) ($) Toddler home-based price as share of family income: 2018 (%) Preschool center-based price: 2018 ($) Preschool center-based price: 2023 (estimated) ($) Preschool center-based price as share of family income: 2018 (%) Preschool home-based price: 2018 ($) Preschool home-based price: 2023 (estimated) ($) Preschool home-based price as share of family income: 2018 (%) School-age center-based price: 2018 ($) School-age center-based price: 2023 (estimated) ($) School-age center-based price as share of family income: 2018 (%) School-age home-based price: 2018 ($) School-age home-based price: 2023 (estimated) ($) School-age home-based price as share of family income: 2018 (%)
Reference_SheetName
Alabama 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Alaska 22 22 22 28 28 28 22 22 22 28 28 28 22 22 22 28 28 28 22 22 22 28 28 28
AllStates 324 324 324 344 344 344 324 324 324 344 344 344 324 324 324 344 344 344 324 324 324 344 344 344
Arizona 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Arkansas 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
California 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Colorado 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
Connecticut 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Delaware 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
DistrictOfColumbia 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Florida 2 2 2 3 3 3 2 2 2 3 3 3 2 2 2 3 3 3 2 2 2 3 3 3
Georgia 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Hawaii 4 4 4 1 1 1 4 4 4 1 1 1 4 4 4 1 1 1 4 4 4 1 1 1
Idaho 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Illinois 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Indiana 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92 92
Iowa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Kansas 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Kentucky 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Louisiana 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Maine 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Maryland 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Massachusetts 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Michigan 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Minnesota 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Mississippi 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Missouri 100 100 100 110 110 110 100 100 100 110 110 110 100 100 100 110 110 110 100 100 100 110 110 110
Montana 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Nebraska 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Nevada 2 2 2 9 9 9 2 2 2 9 9 9 2 2 2 9 9 9 2 2 2 9 9 9
NewHampshire 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NewJersey 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NewMexico 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33
NewYork 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
NorthCarolina 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2
NorthDakota 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Ohio 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Oklahoma 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Oregon 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Pennsylvania 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
RhodeIsland 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SouthCarolina 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
SouthDakota 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Tennessee 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Texas 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Utah 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Vermont 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Virginia 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Washington 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WestVirginia 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Wisconsin 3 3 3 1 1 1 3 3 3 1 1 1 3 3 3 1 1 1 3 3 3 1 1 1
Wyoming 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Handling Missing Values¶

This script creates a clean version of the childcare dataset by replacing all missing values in columns with 0. It ensures the data is ready for analysis by handling null values in numerical columns, which could otherwise disrupt calculations or visualizations. The original dataset remains unchanged, and a new DataFrame is created for further use.

In [54]:
# Create a new DataFrame to preserve the original data
cleaned_dataframe = childcare_df.copy()

# Replace missing values with 0 for all columns that contain missing values
columns_with_nulls = cleaned_dataframe.columns[cleaned_dataframe.isnull().any()] 

# Iterate through columns and replace NaN with 0
for column in columns_with_nulls:
    cleaned_dataframe[column].fillna(0, inplace=True)

# Display the updated DataFrame and confirm no nulls exist
print("Updated DataFrame with missing values replaced:")
print(cleaned_dataframe.isnull().sum())  # Check if all nulls are replaced
Updated DataFrame with missing values replaced:
State name                                                           0
County name                                                          0
County FIPS code                                                     0
Infant center-based price: 2018 ($)                                  0
Infant center-based price: 2023 (estimated) ($)                      0
Infant center-based price as share of family income: 2018 (%)        0
Infant home-based price: 2018 ($)                                    0
Infant home-based price: 2023 (estimated) ($)                        0
Infant home-based price as share of family income: 2018 (%)          0
Toddler center-based price: 2018 ($)                                 0
Toddler center-based price: 2023 (estimated) ($)                     0
Toddler center-based price as share of family income: 2018 (%)       0
Toddler home-based price: 2018 ($)                                   0
Toddler home-based price: 2023 (estimated) ($)                       0
Toddler home-based price as share of family income: 2018 (%)         0
Preschool center-based price: 2018 ($)                               0
Preschool center-based price: 2023 (estimated) ($)                   0
Preschool center-based price as share of family income: 2018 (%)     0
Preschool home-based price: 2018 ($)                                 0
Preschool home-based price: 2023 (estimated) ($)                     0
Preschool home-based price as share of family income: 2018 (%)       0
School-age center-based price: 2018 ($)                              0
School-age center-based price: 2023 (estimated) ($)                  0
School-age center-based price as share of family income: 2018 (%)    0
School-age home-based price: 2018 ($)                                0
School-age home-based price:  2023 (estimated) ($)                   0
School-age home-based price as share of family income: 2018 (%)      0
Women's labor force participation rate (%)                           0
Women's median earnings ($)                                          0
Median family income ($)                                             0
Percent of families in poverty (%)                                   0
Total population                                                     0
Percent Asian (%)                                                    0
Percent Black (%)                                                    0
Percent Hispanic (of any race) (%)                                   0
Percent White (%)                                                    0
Reference_SheetName                                                  0
dtype: int64

Standardizing Column Names for Consistent and Readable Data¶

Standardizing column names is crucial in data preprocessing to ensure that they are descriptive, programmatically friendly, and consistent across the dataset. This process removes problematic characters, applies a uniform format, and retains key information such as dollar signs ($) and percentage signs (%) for clarity.

Standardized column names:

  • Improve programmatic compatibility, making it easier to reference columns in scripts and queries.
  • Enhance consistency, reducing potential errors caused by variations in naming conventions.
  • Retain semantic meaning, ensuring column names remain informative (e.g., keeping $ and %).

The provided script automates this sanitization process, transforming column names into a uniform, clear format while retaining critical information for financial and percentage-related data.

In [55]:
# Apply the function on the `cleaned_dataframe`
cleaned_dataframe = sanitize_dataframe_columns(cleaned_dataframe)

# Display sanitized column names
print("Sanitized Column Names:")
print(cleaned_dataframe.columns)
Sanitized Column Names:
Index(['state_name', 'county_name', 'county_fips_code',
       'infant_center_based_price_2018_dollar',
       'infant_center_based_price_2023_estimated_dollar',
       'infant_center_based_price_as_share_of_family_income_2018_percent',
       'infant_home_based_price_2018_dollar',
       'infant_home_based_price_2023_estimated_dollar',
       'infant_home_based_price_as_share_of_family_income_2018_percent',
       'toddler_center_based_price_2018_dollar',
       'toddler_center_based_price_2023_estimated_dollar',
       'toddler_center_based_price_as_share_of_family_income_2018_percent',
       'toddler_home_based_price_2018_dollar',
       'toddler_home_based_price_2023_estimated_dollar',
       'toddler_home_based_price_as_share_of_family_income_2018_percent',
       'preschool_center_based_price_2018_dollar',
       'preschool_center_based_price_2023_estimated_dollar',
       'preschool_center_based_price_as_share_of_family_income_2018_percent',
       'preschool_home_based_price_2018_dollar',
       'preschool_home_based_price_2023_estimated_dollar',
       'preschool_home_based_price_as_share_of_family_income_2018_percent',
       'school_age_center_based_price_2018_dollar',
       'school_age_center_based_price_2023_estimated_dollar',
       'school_age_center_based_price_as_share_of_family_income_2018_percent',
       'school_age_home_based_price_2018_dollar',
       'school_age_home_based_price_2023_estimated_dollar',
       'school_age_home_based_price_as_share_of_family_income_2018_percent',
       'women_s_labor_force_participation_rate_percent',
       'women_s_median_earnings_dollar', 'median_family_income_dollar',
       'percent_of_families_in_poverty_percent', 'total_population',
       'percent_asian_percent', 'percent_black_percent',
       'percent_hispanic_of_any_race_percent', 'percent_white_percent',
       'reference_sheetname'],
      dtype='object')
In [56]:
print(cleaned_dataframe.dtypes)

# Convert Pandas DataFrame to PySpark DataFrame
spark_df = spark.createDataFrame(cleaned_dataframe)

# Print the schema
spark_df.printSchema()
state_name                                                               object
county_name                                                              object
county_fips_code                                                          int64
infant_center_based_price_2018_dollar                                   float64
infant_center_based_price_2023_estimated_dollar                         float64
infant_center_based_price_as_share_of_family_income_2018_percent        float64
infant_home_based_price_2018_dollar                                     float64
infant_home_based_price_2023_estimated_dollar                           float64
infant_home_based_price_as_share_of_family_income_2018_percent          float64
toddler_center_based_price_2018_dollar                                  float64
toddler_center_based_price_2023_estimated_dollar                        float64
toddler_center_based_price_as_share_of_family_income_2018_percent       float64
toddler_home_based_price_2018_dollar                                    float64
toddler_home_based_price_2023_estimated_dollar                          float64
toddler_home_based_price_as_share_of_family_income_2018_percent         float64
preschool_center_based_price_2018_dollar                                float64
preschool_center_based_price_2023_estimated_dollar                      float64
preschool_center_based_price_as_share_of_family_income_2018_percent     float64
preschool_home_based_price_2018_dollar                                  float64
preschool_home_based_price_2023_estimated_dollar                        float64
preschool_home_based_price_as_share_of_family_income_2018_percent       float64
school_age_center_based_price_2018_dollar                               float64
school_age_center_based_price_2023_estimated_dollar                     float64
school_age_center_based_price_as_share_of_family_income_2018_percent    float64
school_age_home_based_price_2018_dollar                                 float64
school_age_home_based_price_2023_estimated_dollar                       float64
school_age_home_based_price_as_share_of_family_income_2018_percent      float64
women_s_labor_force_participation_rate_percent                          float64
women_s_median_earnings_dollar                                          float64
median_family_income_dollar                                             float64
percent_of_families_in_poverty_percent                                  float64
total_population                                                          int64
percent_asian_percent                                                   float64
percent_black_percent                                                   float64
percent_hispanic_of_any_race_percent                                    float64
percent_white_percent                                                   float64
reference_sheetname                                                      object
dtype: object
root
 |-- state_name: string (nullable = true)
 |-- county_name: string (nullable = true)
 |-- county_fips_code: long (nullable = true)
 |-- infant_center_based_price_2018_dollar: double (nullable = true)
 |-- infant_center_based_price_2023_estimated_dollar: double (nullable = true)
 |-- infant_center_based_price_as_share_of_family_income_2018_percent: double (nullable = true)
 |-- infant_home_based_price_2018_dollar: double (nullable = true)
 |-- infant_home_based_price_2023_estimated_dollar: double (nullable = true)
 |-- infant_home_based_price_as_share_of_family_income_2018_percent: double (nullable = true)
 |-- toddler_center_based_price_2018_dollar: double (nullable = true)
 |-- toddler_center_based_price_2023_estimated_dollar: double (nullable = true)
 |-- toddler_center_based_price_as_share_of_family_income_2018_percent: double (nullable = true)
 |-- toddler_home_based_price_2018_dollar: double (nullable = true)
 |-- toddler_home_based_price_2023_estimated_dollar: double (nullable = true)
 |-- toddler_home_based_price_as_share_of_family_income_2018_percent: double (nullable = true)
 |-- preschool_center_based_price_2018_dollar: double (nullable = true)
 |-- preschool_center_based_price_2023_estimated_dollar: double (nullable = true)
 |-- preschool_center_based_price_as_share_of_family_income_2018_percent: double (nullable = true)
 |-- preschool_home_based_price_2018_dollar: double (nullable = true)
 |-- preschool_home_based_price_2023_estimated_dollar: double (nullable = true)
 |-- preschool_home_based_price_as_share_of_family_income_2018_percent: double (nullable = true)
 |-- school_age_center_based_price_2018_dollar: double (nullable = true)
 |-- school_age_center_based_price_2023_estimated_dollar: double (nullable = true)
 |-- school_age_center_based_price_as_share_of_family_income_2018_percent: double (nullable = true)
 |-- school_age_home_based_price_2018_dollar: double (nullable = true)
 |-- school_age_home_based_price_2023_estimated_dollar: double (nullable = true)
 |-- school_age_home_based_price_as_share_of_family_income_2018_percent: double (nullable = true)
 |-- women_s_labor_force_participation_rate_percent: double (nullable = true)
 |-- women_s_median_earnings_dollar: double (nullable = true)
 |-- median_family_income_dollar: double (nullable = true)
 |-- percent_of_families_in_poverty_percent: double (nullable = true)
 |-- total_population: long (nullable = true)
 |-- percent_asian_percent: double (nullable = true)
 |-- percent_black_percent: double (nullable = true)
 |-- percent_hispanic_of_any_race_percent: double (nullable = true)
 |-- percent_white_percent: double (nullable = true)
 |-- reference_sheetname: string (nullable = true)

In [57]:
# Display the DataFrame
display(cleaned_dataframe)
state_name county_name county_fips_code infant_center_based_price_2018_dollar infant_center_based_price_2023_estimated_dollar infant_center_based_price_as_share_of_family_income_2018_percent infant_home_based_price_2018_dollar infant_home_based_price_2023_estimated_dollar infant_home_based_price_as_share_of_family_income_2018_percent toddler_center_based_price_2018_dollar ... women_s_labor_force_participation_rate_percent women_s_median_earnings_dollar median_family_income_dollar percent_of_families_in_poverty_percent total_population percent_asian_percent percent_black_percent percent_hispanic_of_any_race_percent percent_white_percent reference_sheetname
0 Alabama Autauga County 1001 6261.32 7369.900177 8.515327 5569.20 6555.238842 7.574051 6261.32 ... 66.2 26243.0 73530.0 12.0 55200 1.0 19.1 2.8 76.9 AllStates
1 Alabama Baldwin County 1003 6261.32 7369.900177 8.702200 5613.92 6607.876614 7.802421 6261.32 ... 70.1 25308.0 71951.0 7.3 208107 0.8 9.5 4.5 86.3 AllStates
2 Alabama Barbour County 1005 4482.40 5276.018564 10.109385 4247.36 4999.364226 9.579287 4482.40 ... 60.8 20396.0 44339.0 21.5 25782 0.4 47.6 4.3 47.4 AllStates
3 Alabama Bibb County 1007 5451.16 6416.299606 9.940117 4529.72 5331.716671 8.259883 5451.16 ... 59.5 22416.0 54840.0 10.5 22527 0.2 22.3 2.4 76.7 AllStates
4 Alabama Blount County 1009 9260.16 10899.691251 15.464013 5660.72 6662.962655 9.453124 9260.16 ... 56.3 27579.0 59882.0 10.2 57645 0.3 1.5 9.1 95.5 AllStates
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6279 Wyoming Sweetwater County 56037 7235.80 8516.913957 8.524540 6885.84 8104.992786 8.112250 6524.44 ... 71.2 22290.0 84882.0 10.0 44117 0.6 0.8 16.0 93.1 Wyoming
6280 Wyoming Teton County 56039 20751.12 24425.150440 18.763672 22408.88 26376.420415 20.262659 19022.64 ... 81.2 34836.0 110592.0 2.4 23059 1.2 1.2 14.9 90.3 Wyoming
6281 Wyoming Uinta County 56041 5625.36 6621.342091 7.962631 5379.40 6331.834343 7.614478 5071.04 ... 66.2 19981.0 70647.0 11.0 20609 0.1 0.1 9.1 93.4 Wyoming
6282 Wyoming Washakie County 56043 5760.04 6779.867474 8.643388 5504.20 6478.730452 8.259480 5196.88 ... 70.4 22195.0 66641.0 8.0 8129 0.0 0.0 14.2 89.7 Wyoming
6283 Wyoming Weston County 56045 6561.36 7723.062904 8.695495 6266.00 7375.408781 8.304067 5924.36 ... 67.7 24406.0 75457.0 12.6 7100 4.4 0.5 1.5 92.5 Wyoming

6284 rows × 37 columns

In [58]:
spark_df = spark_df.withColumnRenamed("women_s_labor_force_participation_rate_percent","womens_labor_force_participation_rate_percent")
spark_df = spark_df.withColumnRenamed("women_s_median_earnings_dollar", "womens_median_earnings_dollar")
display(spark_df)
DataFrame[state_name: string, county_name: string, county_fips_code: bigint, infant_center_based_price_2018_dollar: double, infant_center_based_price_2023_estimated_dollar: double, infant_center_based_price_as_share_of_family_income_2018_percent: double, infant_home_based_price_2018_dollar: double, infant_home_based_price_2023_estimated_dollar: double, infant_home_based_price_as_share_of_family_income_2018_percent: double, toddler_center_based_price_2018_dollar: double, toddler_center_based_price_2023_estimated_dollar: double, toddler_center_based_price_as_share_of_family_income_2018_percent: double, toddler_home_based_price_2018_dollar: double, toddler_home_based_price_2023_estimated_dollar: double, toddler_home_based_price_as_share_of_family_income_2018_percent: double, preschool_center_based_price_2018_dollar: double, preschool_center_based_price_2023_estimated_dollar: double, preschool_center_based_price_as_share_of_family_income_2018_percent: double, preschool_home_based_price_2018_dollar: double, preschool_home_based_price_2023_estimated_dollar: double, preschool_home_based_price_as_share_of_family_income_2018_percent: double, school_age_center_based_price_2018_dollar: double, school_age_center_based_price_2023_estimated_dollar: double, school_age_center_based_price_as_share_of_family_income_2018_percent: double, school_age_home_based_price_2018_dollar: double, school_age_home_based_price_2023_estimated_dollar: double, school_age_home_based_price_as_share_of_family_income_2018_percent: double, womens_labor_force_participation_rate_percent: double, womens_median_earnings_dollar: double, median_family_income_dollar: double, percent_of_families_in_poverty_percent: double, total_population: bigint, percent_asian_percent: double, percent_black_percent: double, percent_hispanic_of_any_race_percent: double, percent_white_percent: double, reference_sheetname: string]

Storing data into Cassandra¶

To store the dataset efficiently in Cassandra, a Cassandra environment was initialized using Docker Compose and connected to the database using cqlsh.

  • A keyspace named childcare was created with the SimpleStrategy replication strategy and a replication factor of 1, ensuring single-node data replication for simplicity during testing. Durable writes were enabled to guarantee data persistence.
  • Within the childcare keyspace, a table named childcare_prices with a schema tailored to the dataset. The schema includes columns for state and county information, childcare pricing metrics across various age groups and years (both in dollar amounts and percentages), demographic and economic data such as median family income and racial composition, and a reference column to track the source of each record.

This structure ensures that the data is well-organized and optimized for queries by geographic and demographic attributes.

image.png

image.png

In [59]:
# CASSANDRA CONFIGURATION
cassandra_host = "cassandra"

# Spark init
spark = SparkSession.builder \
    .master("local") \
    .appName('jupyter-pyspark') \
      .config("spark.cassandra.connection.host", cassandra_host) \
      .config("spark.jars.packages","com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.1.0")\
    .getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
print(cassandra_host)
cassandra
In [60]:
# WE NEED A TABLE BEFORE WE CAN WRITE, Using Plain old Python
!pip install -q cassandra-driver
from cassandra.cluster import Cluster
with Cluster([cassandra_host]) as cluster:
    session = cluster.connect()
    session.execute("CREATE KEYSPACE IF NOT EXISTS childcare WITH replication={ 'class': 'SimpleStrategy', 'replication_factor' : 1 };")
    session.execute("""
    CREATE TABLE IF NOT EXISTS childcare.childcare_prices (
        state_name text,
        county_name text,
        county_fips_code int,
        infant_center_based_price_2018_dollar double,
        infant_center_based_price_2023_estimated_dollar double,
        infant_center_based_price_as_share_of_family_income_2018_percent double,
        infant_home_based_price_2018_dollar double,
        infant_home_based_price_2023_estimated_dollar double,
        infant_home_based_price_as_share_of_family_income_2018_percent double,
        toddler_center_based_price_2018_dollar double,
        toddler_center_based_price_2023_estimated_dollar double,
        toddler_center_based_price_as_share_of_family_income_2018_percent double,
        toddler_home_based_price_2018_dollar double,
        toddler_home_based_price_2023_estimated_dollar double,
        toddler_home_based_price_as_share_of_family_income_2018_percent double,
        preschool_center_based_price_2018_dollar double,
        preschool_center_based_price_2023_estimated_dollar double,
        preschool_center_based_price_as_share_of_family_income_2018_percent double,
        preschool_home_based_price_2018_dollar double,
        preschool_home_based_price_2023_estimated_dollar double,
        preschool_home_based_price_as_share_of_family_income_2018_percent double,
        school_age_center_based_price_2018_dollar double,
        school_age_center_based_price_2023_estimated_dollar double,
        school_age_center_based_price_as_share_of_family_income_2018_percent double,
        school_age_home_based_price_2018_dollar double,
        school_age_home_based_price_2023_estimated_dollar double,
        school_age_home_based_price_as_share_of_family_income_2018_percent double,
        womens_labor_force_participation_rate_percent double,
        womens_median_earnings_dollar double,
        median_family_income_dollar double,
        percent_of_families_in_poverty_percent double,
        total_population int,
        percent_asian_percent double,
        percent_black_percent double,
        percent_hispanic_of_any_race_percent double,
        percent_white_percent double,
        reference_sheetname text,
        PRIMARY KEY (state_name, county_name)
    );
    """)
In [61]:
# Write to back to our newly-minted Cassandra table, Append mode is okay here because of Cassandra's default upsert behavior.
spark_df.write.format("org.apache.spark.sql.cassandra")\
  .mode("Append")\
  .option("table", "childcare_prices")\
  .option("keyspace","childcare")\
  .save()
                                                                                

download.pngRead from Cassandra client

image.png

In [62]:
# read back from Cassandra
df1 =spark.read.format("org.apache.spark.sql.cassandra")\
    .options(table="childcare_prices", keyspace="childcare") \
    .load()
df1.toPandas()
                                                                                
Out[62]:
state_name county_name county_fips_code infant_center_based_price_2018_dollar infant_center_based_price_2023_estimated_dollar infant_center_based_price_as_share_of_family_income_2018_percent infant_home_based_price_2018_dollar infant_home_based_price_2023_estimated_dollar infant_home_based_price_as_share_of_family_income_2018_percent median_family_income_dollar ... school_age_home_based_price_as_share_of_family_income_2018_percent toddler_center_based_price_2018_dollar toddler_center_based_price_2023_estimated_dollar toddler_center_based_price_as_share_of_family_income_2018_percent toddler_home_based_price_2018_dollar toddler_home_based_price_2023_estimated_dollar toddler_home_based_price_as_share_of_family_income_2018_percent total_population womens_labor_force_participation_rate_percent womens_median_earnings_dollar
0 Tennessee Anderson County 47001 9620.00 11323.241697 15.715103 7020.0 8262.906103 11.467778 61215.0 ... 5.521522 7676.76 9035.946874 12.540652 6383.0 7513.123883 10.427183 75775 66.9 23018.0
1 Tennessee Bedford County 47003 6605.04 7774.476542 12.264260 5070.0 5967.654408 9.413993 53856.0 ... 6.758764 5720.00 6732.738306 10.620915 5070.0 5967.654408 9.413993 47558 68.7 23171.0
2 Tennessee Benton County 47005 6605.04 7774.476542 14.095864 5070.0 5967.654408 10.819924 46858.0 ... 7.768151 5720.00 6732.738306 12.207094 5070.0 5967.654408 10.819924 16112 59.7 17419.0
3 Tennessee Bledsoe County 47007 6605.04 7774.476542 12.419691 5070.0 5967.654408 9.533301 53182.0 ... 6.844421 5720.00 6732.738306 10.755519 5070.0 5967.654408 9.533301 14602 54.8 21560.0
4 Tennessee Blount County 47009 9620.00 11323.241697 14.354353 7020.0 8262.906103 10.474798 67018.0 ... 5.043421 7676.76 9035.946874 11.454773 6383.0 7513.123883 9.524307 128443 70.0 24574.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3137 Kansas Washington County 20201 5876.00 6916.358442 9.849310 5007.6 5894.206353 8.393704 59659.0 ... 7.121138 6089.20 7167.305960 10.206675 4248.4 5000.588360 7.121138 5525 79.9 22018.0
3138 Kansas Wichita County 20203 5876.00 6916.358442 8.596928 5007.6 5894.206353 7.326408 68350.0 ... 6.215655 6089.20 7167.305960 8.908851 4248.4 5000.588360 6.215655 2143 74.2 27878.0
3139 Kansas Wilson County 20205 5876.00 6916.358442 10.140826 5007.6 5894.206353 8.642137 57944.0 ... 7.331907 6089.20 7167.305960 10.508767 4248.4 5000.588360 7.331907 8780 74.6 21404.0
3140 Kansas Woodson County 20207 5876.00 6916.358442 11.161130 5007.6 5894.206353 9.511653 52647.0 ... 8.069596 6089.20 7167.305960 11.566091 4248.4 5000.588360 8.069596 3170 73.2 19744.0
3141 Kansas Wyandotte County 20209 7883.20 9278.937520 14.822503 5746.0 6763.341662 10.804001 53184.0 ... 9.660048 7332.00 8630.146374 13.786101 5137.6 6047.223133 9.660048 164345 71.1 24739.0

3142 rows × 37 columns

In [63]:
# Filter data from Cassandra for a new table to Elasticsearch

newdf1 = df1.select("state_name", "county_name", "infant_center_based_price_2018_dollar",
        "infant_center_based_price_as_share_of_family_income_2018_percent", "infant_home_based_price_2018_dollar",
        "infant_home_based_price_as_share_of_family_income_2018_percent", "toddler_center_based_price_2018_dollar",
        "toddler_center_based_price_as_share_of_family_income_2018_percent",
        "toddler_home_based_price_2018_dollar", "toddler_home_based_price_as_share_of_family_income_2018_percent",
        "preschool_center_based_price_2018_dollar", "preschool_center_based_price_as_share_of_family_income_2018_percent",
        "preschool_home_based_price_2018_dollar", "preschool_home_based_price_as_share_of_family_income_2018_percent",
        "school_age_center_based_price_2018_dollar", "school_age_center_based_price_as_share_of_family_income_2018_percent",
        "school_age_home_based_price_2018_dollar", "school_age_home_based_price_as_share_of_family_income_2018_percent",
        "total_population", "womens_labor_force_participation_rate_percent", "womens_median_earnings_dollar",
        "median_family_income_dollar", "percent_of_families_in_poverty_percent", "percent_asian_percent",
        "percent_black_percent", "percent_hispanic_of_any_race_percent", "percent_white_percent") \

# Display the new DataFrame
newdf1.toPandas()
                                                                                
Out[63]:
state_name county_name infant_center_based_price_2018_dollar infant_center_based_price_as_share_of_family_income_2018_percent infant_home_based_price_2018_dollar infant_home_based_price_as_share_of_family_income_2018_percent toddler_center_based_price_2018_dollar toddler_center_based_price_as_share_of_family_income_2018_percent toddler_home_based_price_2018_dollar toddler_home_based_price_as_share_of_family_income_2018_percent ... school_age_home_based_price_as_share_of_family_income_2018_percent total_population womens_labor_force_participation_rate_percent womens_median_earnings_dollar median_family_income_dollar percent_of_families_in_poverty_percent percent_asian_percent percent_black_percent percent_hispanic_of_any_race_percent percent_white_percent
0 Oklahoma Adair County 6583.72 15.911545 4958.72 11.984243 5535.40 13.377964 4815.72 11.638640 ... 8.954250 22113 58.3 23033.0 41377.0 23.2 0.7 0.3 6.6 42.3
1 Oklahoma Alfalfa County 7929.48 11.652261 5969.08 8.771480 6665.36 9.794654 5796.96 8.518552 ... 6.550146 5857 68.8 22112.0 68051.0 5.4 0.0 3.3 5.3 77.6
2 Oklahoma Atoka County 6705.40 14.399467 5053.36 10.851805 5639.92 12.111409 4907.76 10.539137 ... 8.117079 13874 61.1 23214.0 46567.0 16.0 0.6 3.9 3.5 72.8
3 Oklahoma Beaver County 7760.48 13.160717 5842.72 9.908457 6523.92 11.063680 5674.24 9.622738 ... 7.399596 5415 62.7 21181.0 58967.0 9.2 0.0 0.3 23.0 80.3
4 Oklahoma Beckham County 7985.64 12.439660 6014.32 9.368829 6714.24 10.459132 5840.64 9.098279 ... 7.000296 22621 61.6 22872.0 64195.0 7.8 0.8 3.8 14.5 83.8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3137 Kansas Washington County 5876.00 9.849310 5007.60 8.393704 6089.20 10.206675 4248.40 7.121138 ... 7.121138 5525 79.9 22018.0 59659.0 7.5 0.0 0.3 3.4 96.5
3138 Kansas Wichita County 5876.00 8.596928 5007.60 7.326408 6089.20 8.908851 4248.40 6.215655 ... 6.215655 2143 74.2 27878.0 68350.0 3.1 0.9 0.9 29.2 90.1
3139 Kansas Wilson County 5876.00 10.140826 5007.60 8.642137 6089.20 10.508767 4248.40 7.331907 ... 7.331907 8780 74.6 21404.0 57944.0 9.3 0.0 0.6 3.0 94.5
3140 Kansas Woodson County 5876.00 11.161130 5007.60 9.511653 6089.20 11.566091 4248.40 8.069596 ... 8.069596 3170 73.2 19744.0 52647.0 13.8 0.0 1.1 0.9 93.7
3141 Kansas Wyandotte County 7883.20 14.822503 5746.00 10.804001 7332.00 13.786101 5137.60 9.660048 ... 9.660048 164345 71.1 24739.0 53184.0 16.1 4.4 22.5 28.4 59.8

3142 rows × 27 columns

And save locally¶

In [68]:
newdf1.write.format("json").mode("overwrite").save("/home/jovyan/Walker-Labs/Project/datasets/childcare.json")
                                                                                

Dataset #2: Labor Force Statistics: Employment and Earnings¶

Problem Statement¶

The dataset contains hierarchical information on employment statistics categorized by age, sex, race, ethnicity, and various employment metrics. The data is unstructured and difficult to query efficiently due to its nested relationships and mixed levels of granularity. Furthermore, the dataset includes unnecessary rows, columns, and some missing values that must be cleaned before storage.

Data Processing¶

  • Remove Unnecessary Columns and Rows:

    • Drop rows containing irrelevant or redundant information, such as totals or footnotes.
    • Eliminate columns with no meaningful data or those outside the scope of the analysis.
  • Handle Missing Values:

    • Remove rows with missing values in key attributes such as Age, Sex, Race, Ethnicity.
    • Ensure remaining data is complete for meaningful analysis.

Categorize Data¶

  • Organize the dataset into hierarchical categories:
    • Main Categories: Summary Totals, White, Black or African American, Asian, Hispanic or Latino ethnicity.
    • Subcategories: Men, Women.
    • Subcollections: Years (e.g., 16 to 19, 20 to 24).
  • Create a nested structure to group related metrics for each demographic.

Transform into JSON¶

  • Convert the cleaned and categorized dataset into a JSON format, preserving its hierarchical structure.

Data Storage¶

MongoDB is chosen as the storage solution because:

  • Support for Hierarchical Data: MongoDB’s flexible schema supports nested structures, allowing the data’s inherent hierarchy to be retained without flattening.
  • Scalable and Efficient Querying: MongoDB enables ad-hoc queries and analysis of hierarchical data with minimal overhead, making it ideal for dynamic insights.
  • Flexibility: The schema-less design ensures future modifications, such as adding new categories or metrics, are seamless.
  • Better Readability: The JSON-like structure makes it easy to visualize relationships between categories and subcategories, aiding downstream analysis.

This approach ensures that the data is both clean and well-organized, allowing for efficient storage and retrieval in MongoDB while preserving the analytical flexibility required for hierarchical datasets.

In [11]:
# Bureau of Labor Statistics' 2018 data on earnings by age, race, & sex

# MinIO Configuration
s3_url = "http://minio:9000"
s3_key = "minio"
s3_secret = "SU2orange!"
s3_bucket = "project"
s3_file_key = "cpsaat08.xlsx"

# Initialize MinIO client
s3 = boto3.client('s3',
                  endpoint_url=s3_url,
                  aws_access_key_id=s3_key,
                  aws_secret_access_key=s3_secret)

# Download Excel file locally from MinIO
local_file = "/tmp/cpsaat08.xlsx"
s3.download_file(s3_bucket, s3_file_key, local_file)

# Read the Excel file using pandas
# Specify the sheet name and skip rows if needed
bls_pd = pd.read_excel(local_file, sheet_name="cpsaat08", skiprows=9)
In [12]:
# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Expand display width to avoid line wrapping

# Display the DataFrame
display(bls_pd)

# Optionally, reset Pandas options to defaults after displaying
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
TOTAL Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0 Total, 16 years and over 128572.0 115402.0 9357.0 3814.0 27189.0 3564.0 22072.0 1553.0 5052.0 1262.0
1 16 to 19 years 1478.0 1273.0 173.0 33.0 3648.0 243.0 3250.0 155.0 355.0 404.0
2 16 to 17 years 231.0 192.0 33.0 5.0 1607.0 31.0 1501.0 75.0 70.0 225.0
3 18 to 19 years 1247.0 1081.0 139.0 27.0 2041.0 212.0 1749.0 80.0 285.0 179.0
4 20 years and over 127095.0 114129.0 9184.0 3781.0 23540.0 3320.0 18822.0 1398.0 4697.0 858.0
5 20 to 24 years 9543.0 8629.0 722.0 192.0 4508.0 713.0 3632.0 163.0 856.0 192.0
6 25 years and over 117552.0 105500.0 8462.0 3589.0 19032.0 2608.0 15190.0 1235.0 3841.0 666.0
7 25 to 54 years 88993.0 80300.0 6156.0 2536.0 11321.0 2022.0 8640.0 659.0 2991.0 399.0
8 55 years and over 28559.0 25200.0 2306.0 1053.0 7711.0 585.0 6549.0 577.0 850.0 266.0
9 Men, 16 years and over 72935.0 66314.0 4774.0 1847.0 9764.0 1650.0 7587.0 527.0 2877.0 521.0
10 16 to 19 years 874.0 758.0 96.0 20.0 1613.0 125.0 1425.0 64.0 219.0 203.0
11 20 years and over 72061.0 65556.0 4678.0 1828.0 8150.0 1525.0 6162.0 463.0 2658.0 318.0
12 20 to 24 years 5261.0 4815.0 367.0 79.0 1929.0 352.0 1515.0 62.0 512.0 84.0
13 25 years and over 66800.0 60740.0 4310.0 1749.0 6222.0 1174.0 4646.0 402.0 2146.0 234.0
14 25 to 54 years 50472.0 46166.0 3125.0 1181.0 3201.0 918.0 2123.0 160.0 1655.0 113.0
15 55 years and over 16328.0 14575.0 1185.0 568.0 3020.0 255.0 2524.0 241.0 491.0 121.0
16 Women, 16 years and over 55638.0 49088.0 4583.0 1966.0 17425.0 1914.0 14485.0 1026.0 2175.0 741.0
17 16 to 19 years 604.0 515.0 76.0 13.0 2035.0 119.0 1825.0 91.0 136.0 201.0
18 20 years and over 55033.0 48574.0 4506.0 1953.0 15390.0 1795.0 12660.0 935.0 2039.0 540.0
19 20 to 24 years 4281.0 3814.0 355.0 113.0 2579.0 361.0 2117.0 101.0 344.0 108.0
20 25 years and over 50752.0 44760.0 4152.0 1840.0 12811.0 1434.0 10543.0 834.0 1695.0 432.0
21 25 to 54 years 38521.0 34135.0 3031.0 1355.0 8120.0 1104.0 6518.0 498.0 1336.0 287.0
22 55 years and over 12231.0 10625.0 1121.0 485.0 4691.0 330.0 4025.0 335.0 358.0 145.0
23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 White NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 Men, 16 years and over 58015.0 52678.0 3849.0 1487.0 7687.0 1183.0 6083.0 421.0 1998.0 383.0
26 16 to 19 years 702.0 610.0 74.0 18.0 1281.0 90.0 1139.0 53.0 139.0 147.0
27 20 years and over 57312.0 52068.0 3775.0 1470.0 6406.0 1093.0 4944.0 369.0 1858.0 235.0
28 20 to 24 years 4111.0 3763.0 285.0 63.0 1450.0 251.0 1153.0 45.0 332.0 58.0
29 25 years and over 53201.0 48305.0 3489.0 1406.0 4957.0 842.0 3791.0 323.0 1526.0 177.0
30 25 to 54 years 39476.0 36068.0 2480.0 928.0 2343.0 644.0 1587.0 112.0 1140.0 75.0
31 55 years and over 13726.0 12238.0 1009.0 479.0 2613.0 198.0 2204.0 211.0 386.0 102.0
32 Women, 16 years and over 41825.0 36781.0 3541.0 1504.0 13934.0 1363.0 11740.0 831.0 1414.0 560.0
33 16 to 19 years 457.0 386.0 62.0 9.0 1620.0 92.0 1453.0 74.0 85.0 145.0
34 20 years and over 41369.0 36395.0 3480.0 1494.0 12313.0 1271.0 10286.0 756.0 1329.0 415.0
35 20 to 24 years 3187.0 2834.0 270.0 83.0 1950.0 256.0 1618.0 77.0 213.0 82.0
36 25 years and over 38182.0 33561.0 3209.0 1412.0 10363.0 1015.0 8669.0 680.0 1116.0 333.0
37 25 to 54 years 28451.0 25121.0 2311.0 1020.0 6338.0 768.0 5181.0 389.0 851.0 214.0
38 55 years and over 9731.0 8441.0 898.0 392.0 4026.0 247.0 3488.0 291.0 265.0 119.0
39 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
40 Black or African American NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
41 Men, 16 years and over 7899.0 7239.0 463.0 197.0 1119.0 299.0 759.0 61.0 591.0 85.0
42 16 to 19 years 103.0 89.0 14.0 0.0 170.0 24.0 140.0 6.0 60.0 34.0
43 20 years and over 7797.0 7150.0 450.0 197.0 948.0 274.0 619.0 55.0 531.0 51.0
44 20 to 24 years 659.0 610.0 42.0 7.0 254.0 64.0 180.0 10.0 118.0 14.0
45 25 years and over 7137.0 6540.0 407.0 190.0 695.0 211.0 439.0 45.0 412.0 37.0
46 25 to 54 years 5704.0 5252.0 312.0 139.0 487.0 181.0 276.0 30.0 345.0 24.0
47 55 years and over 1434.0 1288.0 95.0 51.0 208.0 30.0 164.0 15.0 68.0 13.0
48 Women, 16 years and over 8154.0 7261.0 617.0 275.0 1920.0 365.0 1454.0 101.0 534.0 112.0
49 16 to 19 years 95.0 82.0 10.0 3.0 227.0 14.0 205.0 8.0 39.0 34.0
50 20 years and over 8058.0 7179.0 607.0 272.0 1693.0 350.0 1250.0 93.0 495.0 79.0
51 20 to 24 years 667.0 595.0 51.0 22.0 360.0 73.0 272.0 15.0 86.0 15.0
52 25 years and over 7391.0 6584.0 557.0 250.0 1333.0 278.0 977.0 78.0 409.0 64.0
53 25 to 54 years 5861.0 5243.0 423.0 194.0 964.0 226.0 681.0 58.0 344.0 47.0
54 55 years and over 1530.0 1341.0 133.0 56.0 368.0 52.0 296.0 21.0 64.0 17.0
55 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
56 Asian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
57 Men, 16 years and over 4661.0 4307.0 251.0 104.0 527.0 77.0 425.0 24.0 140.0 19.0
58 16 to 19 years 22.0 20.0 2.0 0.0 63.0 3.0 57.0 3.0 3.0 4.0
59 20 years and over 4639.0 4287.0 249.0 103.0 464.0 74.0 368.0 22.0 137.0 15.0
60 20 to 24 years 209.0 190.0 15.0 4.0 121.0 14.0 104.0 3.0 26.0 7.0
61 25 years and over 4430.0 4097.0 234.0 99.0 343.0 60.0 264.0 19.0 112.0 8.0
62 25 to 54 years 3575.0 3321.0 184.0 70.0 211.0 40.0 161.0 10.0 87.0 5.0
63 55 years and over 855.0 776.0 50.0 29.0 132.0 20.0 103.0 8.0 24.0 3.0
64 Women, 16 years and over 3733.0 3356.0 250.0 127.0 911.0 95.0 754.0 63.0 117.0 29.0
65 16 to 19 years 19.0 17.0 2.0 0.0 62.0 3.0 55.0 4.0 3.0 7.0
66 20 years and over 3714.0 3340.0 248.0 127.0 849.0 92.0 698.0 59.0 114.0 22.0
67 20 to 24 years 196.0 181.0 12.0 4.0 134.0 14.0 114.0 7.0 22.0 5.0
68 25 years and over 3518.0 3159.0 236.0 123.0 715.0 78.0 585.0 53.0 93.0 17.0
69 25 to 54 years 2801.0 2524.0 179.0 98.0 518.0 61.0 422.0 35.0 75.0 12.0
70 55 years and over 717.0 635.0 57.0 25.0 197.0 17.0 163.0 17.0 18.0 5.0
71 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
72 Hispanic or Latino ethnicity NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
73 Men, 16 years and over 13719.0 12502.0 916.0 301.0 1699.0 438.0 1207.0 54.0 601.0 94.0
74 16 to 19 years 230.0 201.0 25.0 4.0 315.0 29.0 278.0 8.0 57.0 47.0
75 20 years and over 13489.0 12301.0 891.0 298.0 1384.0 408.0 930.0 46.0 544.0 47.0
76 20 to 24 years 1276.0 1170.0 86.0 20.0 429.0 87.0 333.0 9.0 120.0 16.0
77 25 years and over 12213.0 11131.0 805.0 277.0 955.0 321.0 597.0 37.0 424.0 31.0
78 25 to 54 years 10306.0 9410.0 676.0 221.0 712.0 268.0 420.0 24.0 336.0 20.0
79 55 years and over 1907.0 1721.0 129.0 57.0 243.0 53.0 177.0 13.0 88.0 11.0
80 Women, 16 years and over 8694.0 7707.0 691.0 296.0 2901.0 508.0 2270.0 123.0 465.0 163.0
81 16 to 19 years 140.0 120.0 17.0 2.0 409.0 33.0 361.0 15.0 33.0 49.0
82 20 years and over 8554.0 7587.0 673.0 293.0 2491.0 475.0 1909.0 107.0 432.0 115.0
83 20 to 24 years 890.0 784.0 78.0 28.0 566.0 96.0 454.0 16.0 76.0 27.0
84 25 years and over 7664.0 6803.0 595.0 265.0 1926.0 379.0 1455.0 91.0 356.0 88.0
85 25 to 54 years 6468.0 5752.0 497.0 219.0 1512.0 299.0 1143.0 70.0 299.0 71.0
86 55 years and over 1196.0 1051.0 99.0 47.0 414.0 80.0 312.0 22.0 57.0 17.0
87 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
88 Footnotes: NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
89 (1) Employed persons are classified as full- o... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
90 (2) Includes some persons at work 35 hours or ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
91 NOTE: Estimates for the above race groups (Whi... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

PART II: Dataset Preprocessing, Transformation, and Storage¶

This will include cleaning data, handling missing values, We'll ensure the data quality and integrity are maintained to ensure reliable analysis.

Description This code performs several essential data exploration and preprocessing tasks on the DataFrames.

Purpose To gather comprehensive information about the DataFrame's structure, statistical summary, and data quality, which is crucial for effective data analysis and preprocessing.

Get Shape and check unique values for Reference_SheetName column¶

In [13]:
print(f"Total number of columns: {len(bls_pd.columns)}")
print(" ")
print("Column names:", bls_pd.columns)
print(" ")
Total number of columns: 11
 
Column names: Index(['TOTAL', 'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4',
       'Unnamed: 5', 'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8', 'Unnamed: 9',
       'Unnamed: 10'],
      dtype='object')
 
In [14]:
# Check unique values in the Reference_SheetName
print(bls_pd['TOTAL'].unique())
print(" ")

# Group by 'Reference_SheetName' and count rows
row_counts = bls_pd.groupby('TOTAL').size().reset_index(name='Total_Rows')

# Sort the results by 'Total_Rows' in descending order
row_counts = row_counts.sort_values(by='Total_Rows', ascending=False)

# Display the result
print(row_counts)

# Get Data Frame shape
#print(f"DataFrame shape: {childcare_df.shape}")
print(" ")

# Display DataFrame info
bls_pd.info()
['Total, 16 years and over' '16 to 19 years' '16 to 17 years'
 '18 to 19 years' '20 years and over' '20 to 24 years' '25 years and over'
 '25 to 54 years' '55 years and over' 'Men, 16 years and over'
 'Women, 16 years and over' nan 'White' 'Black or African American'
 'Asian' 'Hispanic or Latino ethnicity' 'Footnotes:'
 '(1) Employed persons are classified as full- or part-time workers based on their usual weekly hours at all jobs regardless of the number of hours they are at work during the reference week. Persons absent from work also are classified according to their usual status. Full time is 35 hours or more per week; part time is less than 35 hours.'
 '(2) Includes some persons at work 35 hours or more classified by their reason for usually working part time.'
 'NOTE: Estimates for the above race groups (White, Black or African American, and Asian) do not sum to totals because data are not presented for all races. Persons whose ethnicity is identified as Hispanic or Latino may be of any race. Updated population controls are introduced annually with the release of January data.']
 
                                                TOTAL  Total_Rows
9                                   55 years and over          11
7                                      25 to 54 years          11
3                                      16 to 19 years          11
8                                   25 years and over          11
5                                      20 to 24 years          11
6                                   20 years and over          11
14                             Men, 16 years and over           5
18                           Women, 16 years and over           5
4                                      18 to 19 years           1
1   (2) Includes some persons at work 35 hours or ...           1
10                                              Asian           1
11                          Black or African American           1
12                                         Footnotes:           1
13                       Hispanic or Latino ethnicity           1
2                                      16 to 17 years           1
15  NOTE: Estimates for the above race groups (Whi...           1
16                           Total, 16 years and over           1
17                                              White           1
0   (1) Employed persons are classified as full- o...           1
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92 entries, 0 to 91
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   TOTAL        87 non-null     object 
 1   Unnamed: 1   79 non-null     float64
 2   Unnamed: 2   79 non-null     float64
 3   Unnamed: 3   79 non-null     float64
 4   Unnamed: 4   79 non-null     float64
 5   Unnamed: 5   79 non-null     float64
 6   Unnamed: 6   79 non-null     float64
 7   Unnamed: 7   79 non-null     float64
 8   Unnamed: 8   79 non-null     float64
 9   Unnamed: 9   79 non-null     float64
 10  Unnamed: 10  79 non-null     float64
dtypes: float64(10), object(1)
memory usage: 8.0+ KB

Standardizing Column Names for Consistent and Readable Data¶

In [15]:
# Rename columns in a Pandas DataFrame

# Sample column mapping
column_mapping = {
    "TOTAL": "Age, Sex, Race, Ethnicity",
    "Unnamed: 1": "Employed, Full-Time Workers - Total",
    "Unnamed: 2": "Employed, Full-Time Workers at Work 35hrs or More",
    "Unnamed: 3": "Employed, Full-Time Workers at Work Less Than 35hrs",
    "Unnamed: 4": "Employed, Full-Time Workers - Not at Work",
    "Unnamed: 5": "Employed, Part-Time Workers - Total",
    "Unnamed: 6": "Employed, Part-Time for Economic Reasons",
    "Unnamed: 7": "Employed, Part-Time for Non-Economic Reasons",
    "Unnamed: 8": "Employed, Part-Time - Not At Work",
    "Unnamed: 9": "Unemployed, Looking for Full-Time Work",
    "Unnamed: 10": "Unemployed, Looking for Part-Time Work"
}

# Apply mapping and sanitize column names
sanitized_column_mapping = {key: sanitize_column(value) for key, value in column_mapping.items()}

# Sample DataFrame with original column names
original_columns = list(column_mapping.keys())
bls_df = pd.DataFrame(columns=original_columns)

# Rename and sanitize columns
bls_df = bls_df.rename(columns=sanitized_column_mapping)

# Display the new column names
print("Sanitized Column Names:")
print(list(bls_df.columns))
Sanitized Column Names:
['age_sex_race_ethnicity', 'employed_full_time_workers_total', 'employed_full_time_workers_at_work_35hrs_or_more', 'employed_full_time_workers_at_work_less_than_35hrs', 'employed_full_time_workers_not_at_work', 'employed_part_time_workers_total', 'employed_part_time_for_economic_reasons', 'employed_part_time_for_non_economic_reasons', 'employed_part_time_not_at_work', 'unemployed_looking_for_full_time_work', 'unemployed_looking_for_part_time_work']
In [16]:
# Rename and sanitize columns in the original DataFrame (bls_pd)
bls_pd = bls_pd.rename(columns=sanitized_column_mapping)


# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Expand display width to avoid line wrapping

# Display the DataFrame
#display(bls_pd)

# Optionally, reset Pandas options to defaults after displaying
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
In [17]:
# Define the indices of rows to remove
rows_to_remove = [87, 88, 89, 90, 91]

# Drop the rows by index
bls_pd = bls_pd.drop(rows_to_remove, axis=0)

# Reset the index (optional, if you want sequential indices)
bls_pd.reset_index(drop=True, inplace=True)
In [18]:
# Display the updated DataFrame
display(bls_pd)
age_sex_race_ethnicity employed_full_time_workers_total employed_full_time_workers_at_work_35hrs_or_more employed_full_time_workers_at_work_less_than_35hrs employed_full_time_workers_not_at_work employed_part_time_workers_total employed_part_time_for_economic_reasons employed_part_time_for_non_economic_reasons employed_part_time_not_at_work unemployed_looking_for_full_time_work unemployed_looking_for_part_time_work
0 Total, 16 years and over 128572.0 115402.0 9357.0 3814.0 27189.0 3564.0 22072.0 1553.0 5052.0 1262.0
1 16 to 19 years 1478.0 1273.0 173.0 33.0 3648.0 243.0 3250.0 155.0 355.0 404.0
2 16 to 17 years 231.0 192.0 33.0 5.0 1607.0 31.0 1501.0 75.0 70.0 225.0
3 18 to 19 years 1247.0 1081.0 139.0 27.0 2041.0 212.0 1749.0 80.0 285.0 179.0
4 20 years and over 127095.0 114129.0 9184.0 3781.0 23540.0 3320.0 18822.0 1398.0 4697.0 858.0
... ... ... ... ... ... ... ... ... ... ... ...
82 20 years and over 8554.0 7587.0 673.0 293.0 2491.0 475.0 1909.0 107.0 432.0 115.0
83 20 to 24 years 890.0 784.0 78.0 28.0 566.0 96.0 454.0 16.0 76.0 27.0
84 25 years and over 7664.0 6803.0 595.0 265.0 1926.0 379.0 1455.0 91.0 356.0 88.0
85 25 to 54 years 6468.0 5752.0 497.0 219.0 1512.0 299.0 1143.0 70.0 299.0 71.0
86 55 years and over 1196.0 1051.0 99.0 47.0 414.0 80.0 312.0 22.0 57.0 17.0

87 rows × 11 columns

In [19]:
# Remove rows where 'Age, Sex, Race, Ethnicity' column has NaN values
bls_pd = bls_pd.dropna(subset=['age_sex_race_ethnicity'])
In [20]:
# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Expand display width to avoid line wrapping

# Display the DataFrame
display(bls_pd)

# Optionally, reset Pandas options to defaults after displaying
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
age_sex_race_ethnicity employed_full_time_workers_total employed_full_time_workers_at_work_35hrs_or_more employed_full_time_workers_at_work_less_than_35hrs employed_full_time_workers_not_at_work employed_part_time_workers_total employed_part_time_for_economic_reasons employed_part_time_for_non_economic_reasons employed_part_time_not_at_work unemployed_looking_for_full_time_work unemployed_looking_for_part_time_work
0 Total, 16 years and over 128572.0 115402.0 9357.0 3814.0 27189.0 3564.0 22072.0 1553.0 5052.0 1262.0
1 16 to 19 years 1478.0 1273.0 173.0 33.0 3648.0 243.0 3250.0 155.0 355.0 404.0
2 16 to 17 years 231.0 192.0 33.0 5.0 1607.0 31.0 1501.0 75.0 70.0 225.0
3 18 to 19 years 1247.0 1081.0 139.0 27.0 2041.0 212.0 1749.0 80.0 285.0 179.0
4 20 years and over 127095.0 114129.0 9184.0 3781.0 23540.0 3320.0 18822.0 1398.0 4697.0 858.0
5 20 to 24 years 9543.0 8629.0 722.0 192.0 4508.0 713.0 3632.0 163.0 856.0 192.0
6 25 years and over 117552.0 105500.0 8462.0 3589.0 19032.0 2608.0 15190.0 1235.0 3841.0 666.0
7 25 to 54 years 88993.0 80300.0 6156.0 2536.0 11321.0 2022.0 8640.0 659.0 2991.0 399.0
8 55 years and over 28559.0 25200.0 2306.0 1053.0 7711.0 585.0 6549.0 577.0 850.0 266.0
9 Men, 16 years and over 72935.0 66314.0 4774.0 1847.0 9764.0 1650.0 7587.0 527.0 2877.0 521.0
10 16 to 19 years 874.0 758.0 96.0 20.0 1613.0 125.0 1425.0 64.0 219.0 203.0
11 20 years and over 72061.0 65556.0 4678.0 1828.0 8150.0 1525.0 6162.0 463.0 2658.0 318.0
12 20 to 24 years 5261.0 4815.0 367.0 79.0 1929.0 352.0 1515.0 62.0 512.0 84.0
13 25 years and over 66800.0 60740.0 4310.0 1749.0 6222.0 1174.0 4646.0 402.0 2146.0 234.0
14 25 to 54 years 50472.0 46166.0 3125.0 1181.0 3201.0 918.0 2123.0 160.0 1655.0 113.0
15 55 years and over 16328.0 14575.0 1185.0 568.0 3020.0 255.0 2524.0 241.0 491.0 121.0
16 Women, 16 years and over 55638.0 49088.0 4583.0 1966.0 17425.0 1914.0 14485.0 1026.0 2175.0 741.0
17 16 to 19 years 604.0 515.0 76.0 13.0 2035.0 119.0 1825.0 91.0 136.0 201.0
18 20 years and over 55033.0 48574.0 4506.0 1953.0 15390.0 1795.0 12660.0 935.0 2039.0 540.0
19 20 to 24 years 4281.0 3814.0 355.0 113.0 2579.0 361.0 2117.0 101.0 344.0 108.0
20 25 years and over 50752.0 44760.0 4152.0 1840.0 12811.0 1434.0 10543.0 834.0 1695.0 432.0
21 25 to 54 years 38521.0 34135.0 3031.0 1355.0 8120.0 1104.0 6518.0 498.0 1336.0 287.0
22 55 years and over 12231.0 10625.0 1121.0 485.0 4691.0 330.0 4025.0 335.0 358.0 145.0
24 White NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 Men, 16 years and over 58015.0 52678.0 3849.0 1487.0 7687.0 1183.0 6083.0 421.0 1998.0 383.0
26 16 to 19 years 702.0 610.0 74.0 18.0 1281.0 90.0 1139.0 53.0 139.0 147.0
27 20 years and over 57312.0 52068.0 3775.0 1470.0 6406.0 1093.0 4944.0 369.0 1858.0 235.0
28 20 to 24 years 4111.0 3763.0 285.0 63.0 1450.0 251.0 1153.0 45.0 332.0 58.0
29 25 years and over 53201.0 48305.0 3489.0 1406.0 4957.0 842.0 3791.0 323.0 1526.0 177.0
30 25 to 54 years 39476.0 36068.0 2480.0 928.0 2343.0 644.0 1587.0 112.0 1140.0 75.0
31 55 years and over 13726.0 12238.0 1009.0 479.0 2613.0 198.0 2204.0 211.0 386.0 102.0
32 Women, 16 years and over 41825.0 36781.0 3541.0 1504.0 13934.0 1363.0 11740.0 831.0 1414.0 560.0
33 16 to 19 years 457.0 386.0 62.0 9.0 1620.0 92.0 1453.0 74.0 85.0 145.0
34 20 years and over 41369.0 36395.0 3480.0 1494.0 12313.0 1271.0 10286.0 756.0 1329.0 415.0
35 20 to 24 years 3187.0 2834.0 270.0 83.0 1950.0 256.0 1618.0 77.0 213.0 82.0
36 25 years and over 38182.0 33561.0 3209.0 1412.0 10363.0 1015.0 8669.0 680.0 1116.0 333.0
37 25 to 54 years 28451.0 25121.0 2311.0 1020.0 6338.0 768.0 5181.0 389.0 851.0 214.0
38 55 years and over 9731.0 8441.0 898.0 392.0 4026.0 247.0 3488.0 291.0 265.0 119.0
40 Black or African American NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
41 Men, 16 years and over 7899.0 7239.0 463.0 197.0 1119.0 299.0 759.0 61.0 591.0 85.0
42 16 to 19 years 103.0 89.0 14.0 0.0 170.0 24.0 140.0 6.0 60.0 34.0
43 20 years and over 7797.0 7150.0 450.0 197.0 948.0 274.0 619.0 55.0 531.0 51.0
44 20 to 24 years 659.0 610.0 42.0 7.0 254.0 64.0 180.0 10.0 118.0 14.0
45 25 years and over 7137.0 6540.0 407.0 190.0 695.0 211.0 439.0 45.0 412.0 37.0
46 25 to 54 years 5704.0 5252.0 312.0 139.0 487.0 181.0 276.0 30.0 345.0 24.0
47 55 years and over 1434.0 1288.0 95.0 51.0 208.0 30.0 164.0 15.0 68.0 13.0
48 Women, 16 years and over 8154.0 7261.0 617.0 275.0 1920.0 365.0 1454.0 101.0 534.0 112.0
49 16 to 19 years 95.0 82.0 10.0 3.0 227.0 14.0 205.0 8.0 39.0 34.0
50 20 years and over 8058.0 7179.0 607.0 272.0 1693.0 350.0 1250.0 93.0 495.0 79.0
51 20 to 24 years 667.0 595.0 51.0 22.0 360.0 73.0 272.0 15.0 86.0 15.0
52 25 years and over 7391.0 6584.0 557.0 250.0 1333.0 278.0 977.0 78.0 409.0 64.0
53 25 to 54 years 5861.0 5243.0 423.0 194.0 964.0 226.0 681.0 58.0 344.0 47.0
54 55 years and over 1530.0 1341.0 133.0 56.0 368.0 52.0 296.0 21.0 64.0 17.0
56 Asian NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
57 Men, 16 years and over 4661.0 4307.0 251.0 104.0 527.0 77.0 425.0 24.0 140.0 19.0
58 16 to 19 years 22.0 20.0 2.0 0.0 63.0 3.0 57.0 3.0 3.0 4.0
59 20 years and over 4639.0 4287.0 249.0 103.0 464.0 74.0 368.0 22.0 137.0 15.0
60 20 to 24 years 209.0 190.0 15.0 4.0 121.0 14.0 104.0 3.0 26.0 7.0
61 25 years and over 4430.0 4097.0 234.0 99.0 343.0 60.0 264.0 19.0 112.0 8.0
62 25 to 54 years 3575.0 3321.0 184.0 70.0 211.0 40.0 161.0 10.0 87.0 5.0
63 55 years and over 855.0 776.0 50.0 29.0 132.0 20.0 103.0 8.0 24.0 3.0
64 Women, 16 years and over 3733.0 3356.0 250.0 127.0 911.0 95.0 754.0 63.0 117.0 29.0
65 16 to 19 years 19.0 17.0 2.0 0.0 62.0 3.0 55.0 4.0 3.0 7.0
66 20 years and over 3714.0 3340.0 248.0 127.0 849.0 92.0 698.0 59.0 114.0 22.0
67 20 to 24 years 196.0 181.0 12.0 4.0 134.0 14.0 114.0 7.0 22.0 5.0
68 25 years and over 3518.0 3159.0 236.0 123.0 715.0 78.0 585.0 53.0 93.0 17.0
69 25 to 54 years 2801.0 2524.0 179.0 98.0 518.0 61.0 422.0 35.0 75.0 12.0
70 55 years and over 717.0 635.0 57.0 25.0 197.0 17.0 163.0 17.0 18.0 5.0
72 Hispanic or Latino ethnicity NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
73 Men, 16 years and over 13719.0 12502.0 916.0 301.0 1699.0 438.0 1207.0 54.0 601.0 94.0
74 16 to 19 years 230.0 201.0 25.0 4.0 315.0 29.0 278.0 8.0 57.0 47.0
75 20 years and over 13489.0 12301.0 891.0 298.0 1384.0 408.0 930.0 46.0 544.0 47.0
76 20 to 24 years 1276.0 1170.0 86.0 20.0 429.0 87.0 333.0 9.0 120.0 16.0
77 25 years and over 12213.0 11131.0 805.0 277.0 955.0 321.0 597.0 37.0 424.0 31.0
78 25 to 54 years 10306.0 9410.0 676.0 221.0 712.0 268.0 420.0 24.0 336.0 20.0
79 55 years and over 1907.0 1721.0 129.0 57.0 243.0 53.0 177.0 13.0 88.0 11.0
80 Women, 16 years and over 8694.0 7707.0 691.0 296.0 2901.0 508.0 2270.0 123.0 465.0 163.0
81 16 to 19 years 140.0 120.0 17.0 2.0 409.0 33.0 361.0 15.0 33.0 49.0
82 20 years and over 8554.0 7587.0 673.0 293.0 2491.0 475.0 1909.0 107.0 432.0 115.0
83 20 to 24 years 890.0 784.0 78.0 28.0 566.0 96.0 454.0 16.0 76.0 27.0
84 25 years and over 7664.0 6803.0 595.0 265.0 1926.0 379.0 1455.0 91.0 356.0 88.0
85 25 to 54 years 6468.0 5752.0 497.0 219.0 1512.0 299.0 1143.0 70.0 299.0 71.0
86 55 years and over 1196.0 1051.0 99.0 47.0 414.0 80.0 312.0 22.0 57.0 17.0

Save to Mongo¶

In [21]:
# Export the DataFrame to CSV
output_csv_path = "bls_pd_data.csv"  # Local file path
bls_pd.to_csv(output_csv_path, index=False)

print(f"Data exported to {output_csv_path}")
Data exported to bls_pd_data.csv
In [22]:
import os

# Print the current working directory
print("File exported to:", os.getcwd())
File exported to: /home/jovyan/Walker - Labs/Project

See Part II for BLS Data to Mongo¶


Dataset #3: National Center for Education Studies¶

Problem Statement¶

The dataset from the National Center for Education Studies (NCES) contains state-level educational statistics. However, the dataset requires significant preprocessing to address missing values, standardize formats, and optimize storage for efficient querying and analysis. Without these preprocessing steps, the dataset risks becoming error-prone and challenging to analyze, limiting its usability for decision-making and educational policy planning.

Data Processing¶

  • Handle Missing Values:

    • Identify columns with missing values and impute them with default values such as:
    • Zero for numeric fields.
    • "Unknown" for categorical fields.
    • Ensure completeness for accurate analytics and computations.
  • Standardize Column Names:

    • Sanitize column names by:
    • Removing spaces and special characters (e.g., replacing % with _percent, $ with _dollar).
    • Converting to snake_case for consistency and ease of querying.
    • Validate column names for compliance with Cassandra's schema requirements.
  • Validate Data Consistency:

    • Group data by state and validate counts for each category (e.g., school levels, demographics).
    • Ensure data integrity across rows and columns:
    • Numeric consistency for statistical columns.
    • Uniform categorizations for demographic groups.

Prepare Data for Querying¶

  • Clean numeric columns by:

    • Removing invalid or outlier values.
    • Converting strings to numeric types where necessary.
  • Create a flattened, structured dataset optimized for database insertion:

    • Combine relevant metadata into a single table for easier querying.

Data Storage¶

Cassandra is selected as the preferred storage solution due to its scalability and suitability for write-heavy workloads in distributed systems. The storage will be optimized for educational datasets.

Advantages of Cassandra

  • **High Scalability: ** Cassandra can handle large datasets, ideal for managing state-wise and nationwide educational statistics.

  • **Distributed Data Model: ** The architecture enables quick access and querying, even for distributed analytics across multiple regions.

  • **Efficient Querying: ** Schema design with partition keys like state or school_level ensures fast data retrieval.

  • **Write-Optimized: ** Cassandra's write-heavy optimization makes it suitable for continuously ingesting new datasets.

In [ ]:
# MinIO Configuration
s3_url = "http://minio:9000"
s3_key = "minio"
s3_secret = "SU2orange!"
s3_bucket = "project"
s3_file_key = "tabn102.30.xlsx"

# Initialize MinIO client
s3 = boto3.client('s3',
                  endpoint_url=s3_url,
                  aws_access_key_id=s3_key,
                  aws_secret_access_key=s3_secret)

# Download Excel file locally from MinIO
local_file = "/home/jovyan/datasets/tabn102.30.xlsx
s3.download_file(s3_bucket, s3_file_key, local_file)

#if fails, try this replacing local file as this
#local_file = "/tmp/tabn102.30.xlsx

Note: To prevent Jupyter Notebook from throwing errors related to the Spark context:

  • Restart the kernel.
  • Re-run the necessary library installations and imports.
  • Execute the required user defined functions in order.

For current dataset, Dataset #3, I have commented out the initialization of Spark with MinIO to ensure the Spark context is configured for Cassandra. This prevents conflicts and avoids potential errors.


In [ ]:
pip install boto3
In [ ]:
pip install openpyxl
In [ ]:
pip install -q cassandra-driver  
In [ ]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import os
import subprocess
from urllib.request import Request, urlopen, urlretrieve  
from urllib.error import HTTPError

# install if needed: pip install boto3 , openpyxl
import boto3
from openpyxl import load_workbook
from pyspark.sql.functions import lit
import pandas as pd
from IPython.display import display
import re
from pyspark.sql.functions import col
import json
from cassandra.cluster import Cluster
In [ ]:
# 1. Function to sanitize a single column name
def sanitize_column(name):
    """
    Sanitizes a column name by replacing special characters and standardizing the format.
    - Replaces '%' with '_percent' and '$' with '_dollar'.
    - Converts spaces and special characters into underscores.
    - Strips leading/trailing underscores and converts to lowercase.
    """
    # Replace '%' and '$' with meaningful labels
    name = name.replace('%', '_percent')
    name = name.replace('$', '_dollar')
    # Replace special characters with underscores
    name = re.sub(r'[^\w\s]', '_', name)  # Retain alphanumeric and spaces
    # Replace spaces with underscores and collapse multiple underscores
    name = re.sub(r'\s+', '_', name)
    name = re.sub(r'_+', '_', name)
    # Remove leading/trailing underscores and convert to lowercase
    return name.strip('_').lower()

# 2. Function to apply sanitization to all columns in a DataFrame
def sanitize_dataframe_columns(dataframe):
    """
    Sanitizes all column names in a Pandas DataFrame.
    """
    sanitized_columns = [sanitize_column(col) for col in dataframe.columns]
    dataframe.columns = sanitized_columns
    return dataframe
In [10]:
'''
# MinIO Configuration
s3_url = "http://minio:9000"
s3_key = "minio"
s3_secret = "SU2orange!"
s3_bucket = "project"
s3_file_key = "tabn102.30.xlsx"

# Initialize MinIO client
s3 = boto3.client('s3',
                  endpoint_url=s3_url,
                  aws_access_key_id=s3_key,
                  aws_secret_access_key=s3_secret)

# Download Excel file locally from MinIO
'''
local_file = "/home/jovyan/datasets/tabn102.30.xlsx"
#s3.download_file(s3_bucket, s3_file_key, local_file)

# Read the Excel file using pandas
# Specify the sheet name and skip rows if needed
nces_pd = pd.read_excel(local_file, sheet_name="Digest 2022 Table 102.30", skiprows=3)

# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Expand display width to avoid line wrapping

# Display the DataFrame
display(nces_pd)

# Optionally, reset Pandas options to defaults after displaying
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')
pd.reset_option('display.width')
1 2 3 4 Unnamed: 4 5 Unnamed: 6 6 Unnamed: 8 7 Unnamed: 10 8 Unnamed: 12 9 Unnamed: 14 10 Unnamed: 16 11 Unnamed: 18 Unnamed: 19 Unnamed: 20 Unnamed: 21 Unnamed: 22 Unnamed: 23 Unnamed: 24 Unnamed: 25 Unnamed: 26
0 United States 63600.0 68500.0 64300.0 90.0 62300.0 50.0 63800.0 60.0 65100.0 80.0 66700.0 60.0 66800.0 60.0 69600.0 80.0 69700.0 80.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Alabama 49900.0 55700.0 51300.0 450.0 50400.0 360.0 51200.0 520.0 52200.0 460.0 53200.0 520.0 53800.0 510.0 54800.0 390.0 53900.0 470.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 Alaska 87600.0 84100.0 78200.0 1530.0 80400.0 1580.0 83900.0 1610.0 86300.0 1530.0 80900.0 1770.0 80200.0 1500.0 80000.0 1740.0 77800.0 1310.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 Arizona 58300.0 66200.0 61600.0 550.0 58300.0 400.0 58900.0 300.0 60500.0 440.0 62500.0 350.0 63900.0 480.0 65800.0 290.0 69100.0 420.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 Arkansas 44700.0 52500.0 48700.0 510.0 47700.0 490.0 48000.0 330.0 50100.0 630.0 50700.0 510.0 50800.0 470.0 51900.0 560.0 52500.0 610.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 California 75700.0 77500.0 74600.0 270.0 71900.0 270.0 73800.0 270.0 76500.0 240.0 79400.0 200.0 81200.0 210.0 85200.0 200.0 84900.0 330.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 Colorado 63800.0 77000.0 70400.0 470.0 67300.0 550.0 73100.0 620.0 74200.0 440.0 76400.0 520.0 77600.0 430.0 81700.0 510.0 82300.0 480.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 Connecticut 88300.0 88000.0 84800.0 690.0 79700.0 830.0 81600.0 540.0 82900.0 730.0 82000.0 900.0 82400.0 600.0 83500.0 870.0 83800.0 680.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 Delaware 73800.0 77300.0 73000.0 1200.0 69500.0 1130.0 70100.0 890.0 69700.0 1020.0 69500.0 1660.0 69900.0 1030.0 74400.0 1050.0 71100.0 1090.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
9 District of Columbia 65000.0 65500.0 65700.0 1640.0 75800.0 1170.0 86500.0 1730.0 85200.0 2340.0 91000.0 1780.0 91900.0 2240.0 97800.0 1610.0 90100.0 2110.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 Florida 58100.0 63300.0 59000.0 230.0 55300.0 250.0 56500.0 230.0 57400.0 170.0 58100.0 240.0 59800.0 250.0 62800.0 290.0 63100.0 260.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 Georgia 61400.0 69200.0 63400.0 370.0 57800.0 360.0 58600.0 260.0 60500.0 490.0 62100.0 380.0 63400.0 470.0 65700.0 280.0 66600.0 430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
12 Hawaii 82100.0 81300.0 80800.0 1660.0 78500.0 1180.0 84000.0 1400.0 84100.0 1220.0 86000.0 1210.0 86500.0 1320.0 88100.0 1380.0 84900.0 1280.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
13 Idaho 53400.0 61300.0 57600.0 710.0 54200.0 760.0 55200.0 760.0 58500.0 660.0 57700.0 590.0 60000.0 600.0 64600.0 690.0 66500.0 1070.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
14 Illinois 68200.0 76000.0 69900.0 290.0 66000.0 330.0 68100.0 360.0 68800.0 270.0 69600.0 370.0 70200.0 330.0 73300.0 380.0 72200.0 350.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 Indiana 60900.0 67800.0 61200.0 430.0 55600.0 340.0 57800.0 270.0 59100.0 250.0 59900.0 440.0 60100.0 340.0 61000.0 440.0 62700.0 440.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 Iowa 55500.0 64400.0 60600.0 440.0 59700.0 490.0 62600.0 550.0 63500.0 480.0 64700.0 570.0 64700.0 580.0 65400.0 480.0 65600.0 470.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 Kansas 57700.0 66300.0 59700.0 620.0 60100.0 660.0 61600.0 610.0 62000.0 610.0 62400.0 470.0 62800.0 510.0 65800.0 450.0 64100.0 580.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
18 Kentucky 47700.0 54900.0 52000.0 400.0 49900.0 380.0 51700.0 330.0 52700.0 410.0 53500.0 420.0 54200.0 370.0 55400.0 400.0 55600.0 430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 Louisiana 46400.0 53100.0 51100.0 490.0 52900.0 540.0 52300.0 470.0 51000.0 530.0 51000.0 430.0 51700.0 490.0 54100.0 380.0 52100.0 460.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
20 Maine 58900.0 60700.0 59500.0 820.0 57100.0 720.0 58900.0 590.0 59900.0 950.0 62200.0 780.0 60000.0 870.0 62400.0 1130.0 64800.0 760.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
21 Maryland 83300.0 86200.0 85700.0 500.0 85700.0 700.0 86700.0 410.0 89100.0 510.0 89300.0 480.0 89800.0 670.0 91900.0 600.0 90200.0 650.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
22 Massachusetts 78200.0 82400.0 79500.0 590.0 77300.0 310.0 80800.0 430.0 85000.0 530.0 85500.0 610.0 86100.0 640.0 91000.0 640.0 89600.0 780.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
23 Michigan 65600.0 72900.0 64000.0 380.0 56600.0 230.0 58400.0 170.0 59300.0 280.0 60700.0 300.0 61200.0 270.0 63100.0 370.0 63500.0 320.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
24 Minnesota 65400.0 76800.0 72400.0 310.0 69100.0 350.0 72600.0 470.0 74100.0 420.0 75600.0 440.0 75900.0 350.0 79100.0 530.0 77700.0 560.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
25 Mississippi 42600.0 51100.0 45800.0 520.0 45900.0 480.0 46400.0 380.0 47100.0 380.0 48100.0 570.0 48200.0 520.0 48500.0 760.0 48700.0 760.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
26 Missouri 55800.0 61900.0 58400.0 300.0 55200.0 380.0 57500.0 290.0 58400.0 260.0 59300.0 420.0 58800.0 490.0 60800.0 410.0 61800.0 410.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 Montana 48600.0 53900.0 54700.0 820.0 53100.0 860.0 56600.0 980.0 56500.0 750.0 59000.0 960.0 59700.0 770.0 60600.0 780.0 63200.0 870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
28 Nebraska 55000.0 64000.0 61000.0 640.0 60300.0 680.0 62900.0 580.0 64300.0 530.0 66300.0 680.0 64300.0 700.0 67000.0 600.0 66800.0 640.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
29 Nevada 65600.0 72700.0 68400.0 750.0 63500.0 600.0 60000.0 680.0 62300.0 620.0 64100.0 640.0 63300.0 740.0 67100.0 660.0 66300.0 610.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
30 New Hampshire 76900.0 80700.0 78900.0 840.0 76000.0 870.0 80400.0 870.0 80100.0 980.0 81100.0 1140.0 80900.0 1020.0 82600.0 1280.0 88500.0 1300.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
31 New Jersey 86600.0 90000.0 85800.0 440.0 84300.0 580.0 82600.0 420.0 85900.0 480.0 88500.0 450.0 88200.0 450.0 90900.0 490.0 89300.0 660.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
32 New Mexico 51000.0 55700.0 52100.0 630.0 52400.0 560.0 51900.0 600.0 52800.0 570.0 51700.0 800.0 50900.0 590.0 55000.0 620.0 54000.0 900.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
33 New York 69700.0 70800.0 68800.0 360.0 67400.0 280.0 69600.0 210.0 71000.0 430.0 71700.0 340.0 73200.0 420.0 76400.0 290.0 74300.0 370.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
34 North Carolina 56400.0 63900.0 56600.0 270.0 54000.0 280.0 54700.0 360.0 57100.0 200.0 58300.0 340.0 58100.0 380.0 60800.0 390.0 62000.0 330.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
35 North Dakota 49100.0 56400.0 57100.0 600.0 60600.0 1190.0 69300.0 1150.0 68500.0 1050.0 68400.0 710.0 68900.0 1520.0 68400.0 1940.0 66500.0 1110.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
36 Ohio 60700.0 66800.0 60500.0 290.0 56200.0 220.0 58400.0 160.0 59100.0 190.0 59700.0 280.0 60500.0 280.0 62100.0 360.0 62300.0 280.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
37 Oklahoma 49900.0 54500.0 51500.0 480.0 52400.0 310.0 55500.0 360.0 55500.0 430.0 55300.0 390.0 56000.0 300.0 57700.0 390.0 55800.0 380.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
38 Oregon 57600.0 66700.0 59700.0 490.0 58000.0 410.0 61900.0 630.0 65000.0 590.0 66600.0 530.0 68400.0 690.0 71100.0 560.0 71600.0 640.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
39 Pennsylvania 61500.0 65400.0 61900.0 330.0 61400.0 330.0 63700.0 260.0 64200.0 250.0 65400.0 300.0 65700.0 210.0 67300.0 280.0 69000.0 310.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
40 Rhode Island 68100.0 68700.0 71600.0 1160.0 65100.0 940.0 66400.0 1680.0 68400.0 1090.0 70600.0 1250.0 69400.0 1350.0 75400.0 1000.0 74000.0 1450.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
41 South Carolina 55500.0 60500.0 54700.0 520.0 52300.0 320.0 54000.0 380.0 55900.0 410.0 55900.0 360.0 56400.0 410.0 59600.0 540.0 59300.0 520.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
42 South Dakota 47600.0 57600.0 56100.0 750.0 57200.0 920.0 60600.0 760.0 61500.0 880.0 62500.0 880.0 60700.0 950.0 63100.0 1260.0 66100.0 1370.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
43 Tennessee 52500.0 59300.0 54100.0 410.0 51600.0 310.0 54100.0 340.0 54800.0 460.0 56700.0 280.0 56500.0 320.0 59400.0 420.0 59700.0 480.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
44 Texas 57200.0 65100.0 58600.0 210.0 60500.0 260.0 63600.0 220.0 63900.0 210.0 65400.0 330.0 65400.0 180.0 67900.0 320.0 67000.0 310.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
45 Utah 62300.0 74600.0 66700.0 800.0 68200.0 470.0 71900.0 850.0 74500.0 660.0 75600.0 830.0 77100.0 490.0 80300.0 700.0 79400.0 760.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
46 Vermont 63000.0 66600.0 63500.0 1010.0 61500.0 1120.0 65200.0 920.0 65100.0 1150.0 63600.0 1400.0 65600.0 1020.0 66800.0 1130.0 72400.0 1350.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
47 Virginia 70500.0 76100.0 75400.0 460.0 75600.0 350.0 75800.0 380.0 76900.0 510.0 79100.0 370.0 78300.0 450.0 81000.0 540.0 81000.0 440.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
48 Washington 66000.0 74700.0 68500.0 540.0 69300.0 420.0 73300.0 550.0 75800.0 410.0 78500.0 390.0 79900.0 460.0 83400.0 590.0 84200.0 510.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
49 West Virginia 44000.0 48400.0 46500.0 680.0 47600.0 680.0 48100.0 550.0 49000.0 760.0 48000.0 940.0 47600.0 670.0 51800.0 730.0 51200.0 660.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
50 Wisconsin 62300.0 71400.0 65500.0 330.0 61000.0 380.0 63600.0 300.0 64100.0 380.0 65500.0 390.0 65600.0 260.0 68000.0 390.0 67100.0 370.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
51 Wyoming 57300.0 61800.0 64300.0 1280.0 66600.0 1430.0 68900.0 1100.0 67600.0 1520.0 66800.0 1180.0 66400.0 920.0 68900.0 1360.0 65200.0 1670.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
52 \1\Based on 1989 incomes collected in the 1990... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
53 \2\Based on 1999 incomes collected in the 2000... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
54 NOTE: Constant dollars adjusted by the Consume... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
55 SOURCE: U.S. Department of Commerce, Census Bu... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

PART II: Dataset Preprocessing, Transformation, and Storage¶

This will include cleaning data, handling missing values, We'll ensure the data quality and integrity are maintained to ensure reliable analysis.

Description This code performs several essential data exploration and preprocessing tasks on the DataFrames.

Purpose To gather comprehensive information about the DataFrame's structure, statistical summary, and data quality, which is crucial for effective data analysis and preprocessing.

Handling Missing Data¶

In [11]:
# Define the indices of rows to remove
rows_to_remove = [52, 53, 54, 55]

# Drop the rows by index
nces_pd = nces_pd.drop(rows_to_remove, axis=0)

# Reset the index (optional, if you want sequential indices)
nces_pd.reset_index(drop=True, inplace=True)

# Drop all the N/A's
nces_pd = nces_pd.dropna(axis=1)
display(nces_pd)
1 2 3 4 Unnamed: 4 5 Unnamed: 6 6 Unnamed: 8 7 Unnamed: 10 8 Unnamed: 12 9 Unnamed: 14 10 Unnamed: 16 11 Unnamed: 18
0 United States 63600.0 68500.0 64300.0 90.0 62300.0 50.0 63800.0 60.0 65100.0 80.0 66700.0 60.0 66800.0 60.0 69600.0 80.0 69700.0 80.0
1 Alabama 49900.0 55700.0 51300.0 450.0 50400.0 360.0 51200.0 520.0 52200.0 460.0 53200.0 520.0 53800.0 510.0 54800.0 390.0 53900.0 470.0
2 Alaska 87600.0 84100.0 78200.0 1530.0 80400.0 1580.0 83900.0 1610.0 86300.0 1530.0 80900.0 1770.0 80200.0 1500.0 80000.0 1740.0 77800.0 1310.0
3 Arizona 58300.0 66200.0 61600.0 550.0 58300.0 400.0 58900.0 300.0 60500.0 440.0 62500.0 350.0 63900.0 480.0 65800.0 290.0 69100.0 420.0
4 Arkansas 44700.0 52500.0 48700.0 510.0 47700.0 490.0 48000.0 330.0 50100.0 630.0 50700.0 510.0 50800.0 470.0 51900.0 560.0 52500.0 610.0
5 California 75700.0 77500.0 74600.0 270.0 71900.0 270.0 73800.0 270.0 76500.0 240.0 79400.0 200.0 81200.0 210.0 85200.0 200.0 84900.0 330.0
6 Colorado 63800.0 77000.0 70400.0 470.0 67300.0 550.0 73100.0 620.0 74200.0 440.0 76400.0 520.0 77600.0 430.0 81700.0 510.0 82300.0 480.0
7 Connecticut 88300.0 88000.0 84800.0 690.0 79700.0 830.0 81600.0 540.0 82900.0 730.0 82000.0 900.0 82400.0 600.0 83500.0 870.0 83800.0 680.0
8 Delaware 73800.0 77300.0 73000.0 1200.0 69500.0 1130.0 70100.0 890.0 69700.0 1020.0 69500.0 1660.0 69900.0 1030.0 74400.0 1050.0 71100.0 1090.0
9 District of Columbia 65000.0 65500.0 65700.0 1640.0 75800.0 1170.0 86500.0 1730.0 85200.0 2340.0 91000.0 1780.0 91900.0 2240.0 97800.0 1610.0 90100.0 2110.0
10 Florida 58100.0 63300.0 59000.0 230.0 55300.0 250.0 56500.0 230.0 57400.0 170.0 58100.0 240.0 59800.0 250.0 62800.0 290.0 63100.0 260.0
11 Georgia 61400.0 69200.0 63400.0 370.0 57800.0 360.0 58600.0 260.0 60500.0 490.0 62100.0 380.0 63400.0 470.0 65700.0 280.0 66600.0 430.0
12 Hawaii 82100.0 81300.0 80800.0 1660.0 78500.0 1180.0 84000.0 1400.0 84100.0 1220.0 86000.0 1210.0 86500.0 1320.0 88100.0 1380.0 84900.0 1280.0
13 Idaho 53400.0 61300.0 57600.0 710.0 54200.0 760.0 55200.0 760.0 58500.0 660.0 57700.0 590.0 60000.0 600.0 64600.0 690.0 66500.0 1070.0
14 Illinois 68200.0 76000.0 69900.0 290.0 66000.0 330.0 68100.0 360.0 68800.0 270.0 69600.0 370.0 70200.0 330.0 73300.0 380.0 72200.0 350.0
15 Indiana 60900.0 67800.0 61200.0 430.0 55600.0 340.0 57800.0 270.0 59100.0 250.0 59900.0 440.0 60100.0 340.0 61000.0 440.0 62700.0 440.0
16 Iowa 55500.0 64400.0 60600.0 440.0 59700.0 490.0 62600.0 550.0 63500.0 480.0 64700.0 570.0 64700.0 580.0 65400.0 480.0 65600.0 470.0
17 Kansas 57700.0 66300.0 59700.0 620.0 60100.0 660.0 61600.0 610.0 62000.0 610.0 62400.0 470.0 62800.0 510.0 65800.0 450.0 64100.0 580.0
18 Kentucky 47700.0 54900.0 52000.0 400.0 49900.0 380.0 51700.0 330.0 52700.0 410.0 53500.0 420.0 54200.0 370.0 55400.0 400.0 55600.0 430.0
19 Louisiana 46400.0 53100.0 51100.0 490.0 52900.0 540.0 52300.0 470.0 51000.0 530.0 51000.0 430.0 51700.0 490.0 54100.0 380.0 52100.0 460.0
20 Maine 58900.0 60700.0 59500.0 820.0 57100.0 720.0 58900.0 590.0 59900.0 950.0 62200.0 780.0 60000.0 870.0 62400.0 1130.0 64800.0 760.0
21 Maryland 83300.0 86200.0 85700.0 500.0 85700.0 700.0 86700.0 410.0 89100.0 510.0 89300.0 480.0 89800.0 670.0 91900.0 600.0 90200.0 650.0
22 Massachusetts 78200.0 82400.0 79500.0 590.0 77300.0 310.0 80800.0 430.0 85000.0 530.0 85500.0 610.0 86100.0 640.0 91000.0 640.0 89600.0 780.0
23 Michigan 65600.0 72900.0 64000.0 380.0 56600.0 230.0 58400.0 170.0 59300.0 280.0 60700.0 300.0 61200.0 270.0 63100.0 370.0 63500.0 320.0
24 Minnesota 65400.0 76800.0 72400.0 310.0 69100.0 350.0 72600.0 470.0 74100.0 420.0 75600.0 440.0 75900.0 350.0 79100.0 530.0 77700.0 560.0
25 Mississippi 42600.0 51100.0 45800.0 520.0 45900.0 480.0 46400.0 380.0 47100.0 380.0 48100.0 570.0 48200.0 520.0 48500.0 760.0 48700.0 760.0
26 Missouri 55800.0 61900.0 58400.0 300.0 55200.0 380.0 57500.0 290.0 58400.0 260.0 59300.0 420.0 58800.0 490.0 60800.0 410.0 61800.0 410.0
27 Montana 48600.0 53900.0 54700.0 820.0 53100.0 860.0 56600.0 980.0 56500.0 750.0 59000.0 960.0 59700.0 770.0 60600.0 780.0 63200.0 870.0
28 Nebraska 55000.0 64000.0 61000.0 640.0 60300.0 680.0 62900.0 580.0 64300.0 530.0 66300.0 680.0 64300.0 700.0 67000.0 600.0 66800.0 640.0
29 Nevada 65600.0 72700.0 68400.0 750.0 63500.0 600.0 60000.0 680.0 62300.0 620.0 64100.0 640.0 63300.0 740.0 67100.0 660.0 66300.0 610.0
30 New Hampshire 76900.0 80700.0 78900.0 840.0 76000.0 870.0 80400.0 870.0 80100.0 980.0 81100.0 1140.0 80900.0 1020.0 82600.0 1280.0 88500.0 1300.0
31 New Jersey 86600.0 90000.0 85800.0 440.0 84300.0 580.0 82600.0 420.0 85900.0 480.0 88500.0 450.0 88200.0 450.0 90900.0 490.0 89300.0 660.0
32 New Mexico 51000.0 55700.0 52100.0 630.0 52400.0 560.0 51900.0 600.0 52800.0 570.0 51700.0 800.0 50900.0 590.0 55000.0 620.0 54000.0 900.0
33 New York 69700.0 70800.0 68800.0 360.0 67400.0 280.0 69600.0 210.0 71000.0 430.0 71700.0 340.0 73200.0 420.0 76400.0 290.0 74300.0 370.0
34 North Carolina 56400.0 63900.0 56600.0 270.0 54000.0 280.0 54700.0 360.0 57100.0 200.0 58300.0 340.0 58100.0 380.0 60800.0 390.0 62000.0 330.0
35 North Dakota 49100.0 56400.0 57100.0 600.0 60600.0 1190.0 69300.0 1150.0 68500.0 1050.0 68400.0 710.0 68900.0 1520.0 68400.0 1940.0 66500.0 1110.0
36 Ohio 60700.0 66800.0 60500.0 290.0 56200.0 220.0 58400.0 160.0 59100.0 190.0 59700.0 280.0 60500.0 280.0 62100.0 360.0 62300.0 280.0
37 Oklahoma 49900.0 54500.0 51500.0 480.0 52400.0 310.0 55500.0 360.0 55500.0 430.0 55300.0 390.0 56000.0 300.0 57700.0 390.0 55800.0 380.0
38 Oregon 57600.0 66700.0 59700.0 490.0 58000.0 410.0 61900.0 630.0 65000.0 590.0 66600.0 530.0 68400.0 690.0 71100.0 560.0 71600.0 640.0
39 Pennsylvania 61500.0 65400.0 61900.0 330.0 61400.0 330.0 63700.0 260.0 64200.0 250.0 65400.0 300.0 65700.0 210.0 67300.0 280.0 69000.0 310.0
40 Rhode Island 68100.0 68700.0 71600.0 1160.0 65100.0 940.0 66400.0 1680.0 68400.0 1090.0 70600.0 1250.0 69400.0 1350.0 75400.0 1000.0 74000.0 1450.0
41 South Carolina 55500.0 60500.0 54700.0 520.0 52300.0 320.0 54000.0 380.0 55900.0 410.0 55900.0 360.0 56400.0 410.0 59600.0 540.0 59300.0 520.0
42 South Dakota 47600.0 57600.0 56100.0 750.0 57200.0 920.0 60600.0 760.0 61500.0 880.0 62500.0 880.0 60700.0 950.0 63100.0 1260.0 66100.0 1370.0
43 Tennessee 52500.0 59300.0 54100.0 410.0 51600.0 310.0 54100.0 340.0 54800.0 460.0 56700.0 280.0 56500.0 320.0 59400.0 420.0 59700.0 480.0
44 Texas 57200.0 65100.0 58600.0 210.0 60500.0 260.0 63600.0 220.0 63900.0 210.0 65400.0 330.0 65400.0 180.0 67900.0 320.0 67000.0 310.0
45 Utah 62300.0 74600.0 66700.0 800.0 68200.0 470.0 71900.0 850.0 74500.0 660.0 75600.0 830.0 77100.0 490.0 80300.0 700.0 79400.0 760.0
46 Vermont 63000.0 66600.0 63500.0 1010.0 61500.0 1120.0 65200.0 920.0 65100.0 1150.0 63600.0 1400.0 65600.0 1020.0 66800.0 1130.0 72400.0 1350.0
47 Virginia 70500.0 76100.0 75400.0 460.0 75600.0 350.0 75800.0 380.0 76900.0 510.0 79100.0 370.0 78300.0 450.0 81000.0 540.0 81000.0 440.0
48 Washington 66000.0 74700.0 68500.0 540.0 69300.0 420.0 73300.0 550.0 75800.0 410.0 78500.0 390.0 79900.0 460.0 83400.0 590.0 84200.0 510.0
49 West Virginia 44000.0 48400.0 46500.0 680.0 47600.0 680.0 48100.0 550.0 49000.0 760.0 48000.0 940.0 47600.0 670.0 51800.0 730.0 51200.0 660.0
50 Wisconsin 62300.0 71400.0 65500.0 330.0 61000.0 380.0 63600.0 300.0 64100.0 380.0 65500.0 390.0 65600.0 260.0 68000.0 390.0 67100.0 370.0
51 Wyoming 57300.0 61800.0 64300.0 1280.0 66600.0 1430.0 68900.0 1100.0 67600.0 1520.0 66800.0 1180.0 66400.0 920.0 68900.0 1360.0 65200.0 1670.0
In [10]:
print("Column names:", nces_pd.columns)
Column names: Index([            1,           '2',             3,             4,
        'Unnamed: 4',             5,  'Unnamed: 6',             6,
        'Unnamed: 8',             7, 'Unnamed: 10',             8,
       'Unnamed: 12',             9, 'Unnamed: 14',            10,
       'Unnamed: 16',            11, 'Unnamed: 18'],
      dtype='object')

Standardizing Column Names for Consistent and Readable Data¶

In [12]:
# Rename columns in a Pandas DataFrame

# Sample column mapping
column_mapping = {    
    1: "state_name",
    "2":"Median Income 1990",
    3:"Median Income 2000",
    4:"Median Income 2005",
    "Unnamed: 4": "Std Error 2005",
    5:"Median Income 2010",
    "Unnamed: 6": "Std Error 2010",
    6:"Median Income 2015",
    "Unnamed: 8": "Std Error 2015",
    7:"Median Income 2016",
    "Unnamed: 10": "Std Error 2016",
    8:"Median Income 2017",
    "Unnamed: 12": "Std Error 2017",
    9:"Median Income 2018",
    "Unnamed: 14": "Std Error 2018",
    10:"Median Income 2019",
    "Unnamed: 16": "Std Error 2019",
    11:"Median Income 2021",
    "Unnamed: 18": "Std Error 2021"
}

# Apply mapping and sanitize column names
sanitized_column_mapping = {key: sanitize_column(value) for key, value in column_mapping.items()}

# Sample DataFrame with original column names
original_columns = list(column_mapping.keys())
nces_df = pd.DataFrame(columns=original_columns)

# Rename and sanitize columns
nces_df = nces_df.rename(columns=sanitized_column_mapping)

# Display the new column names
#print("Sanitized Column Names:")
#print(list(nces_df.columns))

# Rename and sanitize columns in the original DataFrame (bls_pd)
nces_pd = nces_pd.rename(columns=sanitized_column_mapping)


# Adjust Pandas display options to show all rows
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Expand display width to avoid line wrapping

# Display the DataFrame
display(nces_pd)
state_name median_income_1990 median_income_2000 median_income_2005 std_error_2005 median_income_2010 std_error_2010 median_income_2015 std_error_2015 median_income_2016 std_error_2016 median_income_2017 std_error_2017 median_income_2018 std_error_2018 median_income_2019 std_error_2019 median_income_2021 std_error_2021
0 United States 63600.0 68500.0 64300.0 90.0 62300.0 50.0 63800.0 60.0 65100.0 80.0 66700.0 60.0 66800.0 60.0 69600.0 80.0 69700.0 80.0
1 Alabama 49900.0 55700.0 51300.0 450.0 50400.0 360.0 51200.0 520.0 52200.0 460.0 53200.0 520.0 53800.0 510.0 54800.0 390.0 53900.0 470.0
2 Alaska 87600.0 84100.0 78200.0 1530.0 80400.0 1580.0 83900.0 1610.0 86300.0 1530.0 80900.0 1770.0 80200.0 1500.0 80000.0 1740.0 77800.0 1310.0
3 Arizona 58300.0 66200.0 61600.0 550.0 58300.0 400.0 58900.0 300.0 60500.0 440.0 62500.0 350.0 63900.0 480.0 65800.0 290.0 69100.0 420.0
4 Arkansas 44700.0 52500.0 48700.0 510.0 47700.0 490.0 48000.0 330.0 50100.0 630.0 50700.0 510.0 50800.0 470.0 51900.0 560.0 52500.0 610.0
5 California 75700.0 77500.0 74600.0 270.0 71900.0 270.0 73800.0 270.0 76500.0 240.0 79400.0 200.0 81200.0 210.0 85200.0 200.0 84900.0 330.0
6 Colorado 63800.0 77000.0 70400.0 470.0 67300.0 550.0 73100.0 620.0 74200.0 440.0 76400.0 520.0 77600.0 430.0 81700.0 510.0 82300.0 480.0
7 Connecticut 88300.0 88000.0 84800.0 690.0 79700.0 830.0 81600.0 540.0 82900.0 730.0 82000.0 900.0 82400.0 600.0 83500.0 870.0 83800.0 680.0
8 Delaware 73800.0 77300.0 73000.0 1200.0 69500.0 1130.0 70100.0 890.0 69700.0 1020.0 69500.0 1660.0 69900.0 1030.0 74400.0 1050.0 71100.0 1090.0
9 District of Columbia 65000.0 65500.0 65700.0 1640.0 75800.0 1170.0 86500.0 1730.0 85200.0 2340.0 91000.0 1780.0 91900.0 2240.0 97800.0 1610.0 90100.0 2110.0
10 Florida 58100.0 63300.0 59000.0 230.0 55300.0 250.0 56500.0 230.0 57400.0 170.0 58100.0 240.0 59800.0 250.0 62800.0 290.0 63100.0 260.0
11 Georgia 61400.0 69200.0 63400.0 370.0 57800.0 360.0 58600.0 260.0 60500.0 490.0 62100.0 380.0 63400.0 470.0 65700.0 280.0 66600.0 430.0
12 Hawaii 82100.0 81300.0 80800.0 1660.0 78500.0 1180.0 84000.0 1400.0 84100.0 1220.0 86000.0 1210.0 86500.0 1320.0 88100.0 1380.0 84900.0 1280.0
13 Idaho 53400.0 61300.0 57600.0 710.0 54200.0 760.0 55200.0 760.0 58500.0 660.0 57700.0 590.0 60000.0 600.0 64600.0 690.0 66500.0 1070.0
14 Illinois 68200.0 76000.0 69900.0 290.0 66000.0 330.0 68100.0 360.0 68800.0 270.0 69600.0 370.0 70200.0 330.0 73300.0 380.0 72200.0 350.0
15 Indiana 60900.0 67800.0 61200.0 430.0 55600.0 340.0 57800.0 270.0 59100.0 250.0 59900.0 440.0 60100.0 340.0 61000.0 440.0 62700.0 440.0
16 Iowa 55500.0 64400.0 60600.0 440.0 59700.0 490.0 62600.0 550.0 63500.0 480.0 64700.0 570.0 64700.0 580.0 65400.0 480.0 65600.0 470.0
17 Kansas 57700.0 66300.0 59700.0 620.0 60100.0 660.0 61600.0 610.0 62000.0 610.0 62400.0 470.0 62800.0 510.0 65800.0 450.0 64100.0 580.0
18 Kentucky 47700.0 54900.0 52000.0 400.0 49900.0 380.0 51700.0 330.0 52700.0 410.0 53500.0 420.0 54200.0 370.0 55400.0 400.0 55600.0 430.0
19 Louisiana 46400.0 53100.0 51100.0 490.0 52900.0 540.0 52300.0 470.0 51000.0 530.0 51000.0 430.0 51700.0 490.0 54100.0 380.0 52100.0 460.0
20 Maine 58900.0 60700.0 59500.0 820.0 57100.0 720.0 58900.0 590.0 59900.0 950.0 62200.0 780.0 60000.0 870.0 62400.0 1130.0 64800.0 760.0
21 Maryland 83300.0 86200.0 85700.0 500.0 85700.0 700.0 86700.0 410.0 89100.0 510.0 89300.0 480.0 89800.0 670.0 91900.0 600.0 90200.0 650.0
22 Massachusetts 78200.0 82400.0 79500.0 590.0 77300.0 310.0 80800.0 430.0 85000.0 530.0 85500.0 610.0 86100.0 640.0 91000.0 640.0 89600.0 780.0
23 Michigan 65600.0 72900.0 64000.0 380.0 56600.0 230.0 58400.0 170.0 59300.0 280.0 60700.0 300.0 61200.0 270.0 63100.0 370.0 63500.0 320.0
24 Minnesota 65400.0 76800.0 72400.0 310.0 69100.0 350.0 72600.0 470.0 74100.0 420.0 75600.0 440.0 75900.0 350.0 79100.0 530.0 77700.0 560.0
25 Mississippi 42600.0 51100.0 45800.0 520.0 45900.0 480.0 46400.0 380.0 47100.0 380.0 48100.0 570.0 48200.0 520.0 48500.0 760.0 48700.0 760.0
26 Missouri 55800.0 61900.0 58400.0 300.0 55200.0 380.0 57500.0 290.0 58400.0 260.0 59300.0 420.0 58800.0 490.0 60800.0 410.0 61800.0 410.0
27 Montana 48600.0 53900.0 54700.0 820.0 53100.0 860.0 56600.0 980.0 56500.0 750.0 59000.0 960.0 59700.0 770.0 60600.0 780.0 63200.0 870.0
28 Nebraska 55000.0 64000.0 61000.0 640.0 60300.0 680.0 62900.0 580.0 64300.0 530.0 66300.0 680.0 64300.0 700.0 67000.0 600.0 66800.0 640.0
29 Nevada 65600.0 72700.0 68400.0 750.0 63500.0 600.0 60000.0 680.0 62300.0 620.0 64100.0 640.0 63300.0 740.0 67100.0 660.0 66300.0 610.0
30 New Hampshire 76900.0 80700.0 78900.0 840.0 76000.0 870.0 80400.0 870.0 80100.0 980.0 81100.0 1140.0 80900.0 1020.0 82600.0 1280.0 88500.0 1300.0
31 New Jersey 86600.0 90000.0 85800.0 440.0 84300.0 580.0 82600.0 420.0 85900.0 480.0 88500.0 450.0 88200.0 450.0 90900.0 490.0 89300.0 660.0
32 New Mexico 51000.0 55700.0 52100.0 630.0 52400.0 560.0 51900.0 600.0 52800.0 570.0 51700.0 800.0 50900.0 590.0 55000.0 620.0 54000.0 900.0
33 New York 69700.0 70800.0 68800.0 360.0 67400.0 280.0 69600.0 210.0 71000.0 430.0 71700.0 340.0 73200.0 420.0 76400.0 290.0 74300.0 370.0
34 North Carolina 56400.0 63900.0 56600.0 270.0 54000.0 280.0 54700.0 360.0 57100.0 200.0 58300.0 340.0 58100.0 380.0 60800.0 390.0 62000.0 330.0
35 North Dakota 49100.0 56400.0 57100.0 600.0 60600.0 1190.0 69300.0 1150.0 68500.0 1050.0 68400.0 710.0 68900.0 1520.0 68400.0 1940.0 66500.0 1110.0
36 Ohio 60700.0 66800.0 60500.0 290.0 56200.0 220.0 58400.0 160.0 59100.0 190.0 59700.0 280.0 60500.0 280.0 62100.0 360.0 62300.0 280.0
37 Oklahoma 49900.0 54500.0 51500.0 480.0 52400.0 310.0 55500.0 360.0 55500.0 430.0 55300.0 390.0 56000.0 300.0 57700.0 390.0 55800.0 380.0
38 Oregon 57600.0 66700.0 59700.0 490.0 58000.0 410.0 61900.0 630.0 65000.0 590.0 66600.0 530.0 68400.0 690.0 71100.0 560.0 71600.0 640.0
39 Pennsylvania 61500.0 65400.0 61900.0 330.0 61400.0 330.0 63700.0 260.0 64200.0 250.0 65400.0 300.0 65700.0 210.0 67300.0 280.0 69000.0 310.0
40 Rhode Island 68100.0 68700.0 71600.0 1160.0 65100.0 940.0 66400.0 1680.0 68400.0 1090.0 70600.0 1250.0 69400.0 1350.0 75400.0 1000.0 74000.0 1450.0
41 South Carolina 55500.0 60500.0 54700.0 520.0 52300.0 320.0 54000.0 380.0 55900.0 410.0 55900.0 360.0 56400.0 410.0 59600.0 540.0 59300.0 520.0
42 South Dakota 47600.0 57600.0 56100.0 750.0 57200.0 920.0 60600.0 760.0 61500.0 880.0 62500.0 880.0 60700.0 950.0 63100.0 1260.0 66100.0 1370.0
43 Tennessee 52500.0 59300.0 54100.0 410.0 51600.0 310.0 54100.0 340.0 54800.0 460.0 56700.0 280.0 56500.0 320.0 59400.0 420.0 59700.0 480.0
44 Texas 57200.0 65100.0 58600.0 210.0 60500.0 260.0 63600.0 220.0 63900.0 210.0 65400.0 330.0 65400.0 180.0 67900.0 320.0 67000.0 310.0
45 Utah 62300.0 74600.0 66700.0 800.0 68200.0 470.0 71900.0 850.0 74500.0 660.0 75600.0 830.0 77100.0 490.0 80300.0 700.0 79400.0 760.0
46 Vermont 63000.0 66600.0 63500.0 1010.0 61500.0 1120.0 65200.0 920.0 65100.0 1150.0 63600.0 1400.0 65600.0 1020.0 66800.0 1130.0 72400.0 1350.0
47 Virginia 70500.0 76100.0 75400.0 460.0 75600.0 350.0 75800.0 380.0 76900.0 510.0 79100.0 370.0 78300.0 450.0 81000.0 540.0 81000.0 440.0
48 Washington 66000.0 74700.0 68500.0 540.0 69300.0 420.0 73300.0 550.0 75800.0 410.0 78500.0 390.0 79900.0 460.0 83400.0 590.0 84200.0 510.0
49 West Virginia 44000.0 48400.0 46500.0 680.0 47600.0 680.0 48100.0 550.0 49000.0 760.0 48000.0 940.0 47600.0 670.0 51800.0 730.0 51200.0 660.0
50 Wisconsin 62300.0 71400.0 65500.0 330.0 61000.0 380.0 63600.0 300.0 64100.0 380.0 65500.0 390.0 65600.0 260.0 68000.0 390.0 67100.0 370.0
51 Wyoming 57300.0 61800.0 64300.0 1280.0 66600.0 1430.0 68900.0 1100.0 67600.0 1520.0 66800.0 1180.0 66400.0 920.0 68900.0 1360.0 65200.0 1670.0
In [14]:
cassandra_host = "cassandra"

# Create a SparkSession with Cassandra configuration
spark = SparkSession.builder \
    .appName("WriteToCassandra") \
    .config("spark.cassandra.connection.host", cassandra_host) \
    .config("spark.jars.packages","com.datastax.spark:spark-cassandra-connector-assembly_2.12:3.1.0")\
    .config("spark.sql.extensions", "com.datastax.spark.connector.CassandraSparkExtensions") \
    .getOrCreate()
In [15]:
nces_df = spark.createDataFrame(nces_pd)
nces_df.printSchema()
root
 |-- state_name: string (nullable = true)
 |-- median_income_1990: double (nullable = true)
 |-- median_income_2000: double (nullable = true)
 |-- median_income_2005: double (nullable = true)
 |-- std_error_2005: double (nullable = true)
 |-- median_income_2010: double (nullable = true)
 |-- std_error_2010: double (nullable = true)
 |-- median_income_2015: double (nullable = true)
 |-- std_error_2015: double (nullable = true)
 |-- median_income_2016: double (nullable = true)
 |-- std_error_2016: double (nullable = true)
 |-- median_income_2017: double (nullable = true)
 |-- std_error_2017: double (nullable = true)
 |-- median_income_2018: double (nullable = true)
 |-- std_error_2018: double (nullable = true)
 |-- median_income_2019: double (nullable = true)
 |-- std_error_2019: double (nullable = true)
 |-- median_income_2021: double (nullable = true)
 |-- std_error_2021: double (nullable = true)

Storing data into Cassandra¶

Automate Keyspace and Table Creation¶

This code demonstrates how to create a keyspace and table in Cassandra programmatically using Spark, eliminating the need to manually create them via Cassandra's cqlsh command-line interface.

In [27]:
from cassandra.cluster import Cluster

# Connect to Cassandra
cluster = Cluster(["cassandra"])  # Replace "cassandra" with your host if different
session = cluster.connect()

# Create Keyspace
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS nces
    WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
""")

# Create Table
session.set_keyspace("nces")
session.execute("""
    CREATE TABLE IF NOT EXISTS nces_income (
        state_name TEXT PRIMARY KEY,
        median_income_1990 DOUBLE,
        median_income_2000 DOUBLE,
        median_income_2005 DOUBLE,
        std_error_2005 DOUBLE,
        median_income_2010 DOUBLE,
        std_error_2010 DOUBLE,
        median_income_2015 DOUBLE,
        std_error_2015 DOUBLE,
        median_income_2016 DOUBLE,
        std_error_2016 DOUBLE,
        median_income_2017 DOUBLE,
        std_error_2017 DOUBLE,
        median_income_2018 DOUBLE,
        std_error_2018 DOUBLE,
        median_income_2019 DOUBLE,
        std_error_2019 DOUBLE,
        median_income_2020 DOUBLE,
        std_error_2020 DOUBLE,
        median_income_2021 DOUBLE,
        std_error_2021 DOUBLE
    );
""")

print("Keyspace and Table created successfully!")

# Close the connection
cluster.shutdown()
Keyspace and Table created successfully!

image.png

In [28]:
# Write DataFrame to Cassandra
nces_df.write \
    .format("org.apache.spark.sql.cassandra") \
    .mode("append") \
    .options(table="nces_income", keyspace="nces") \
    .save()

# read back from Cassandra
df3 =spark.read.format("org.apache.spark.sql.cassandra")\
    .options(table="nces_income", keyspace="nces") \
    .load()
df3.printSchema()
[Stage 2:>                                                          (0 + 8) / 8]
root
 |-- state_name: string (nullable = false)
 |-- median_income_1990: double (nullable = true)
 |-- median_income_2000: double (nullable = true)
 |-- median_income_2005: double (nullable = true)
 |-- median_income_2010: double (nullable = true)
 |-- median_income_2015: double (nullable = true)
 |-- median_income_2016: double (nullable = true)
 |-- median_income_2017: double (nullable = true)
 |-- median_income_2018: double (nullable = true)
 |-- median_income_2019: double (nullable = true)
 |-- median_income_2020: double (nullable = true)
 |-- median_income_2021: double (nullable = true)
 |-- std_error_2005: double (nullable = true)
 |-- std_error_2010: double (nullable = true)
 |-- std_error_2015: double (nullable = true)
 |-- std_error_2016: double (nullable = true)
 |-- std_error_2017: double (nullable = true)
 |-- std_error_2018: double (nullable = true)
 |-- std_error_2019: double (nullable = true)
 |-- std_error_2020: double (nullable = true)
 |-- std_error_2021: double (nullable = true)

                                                                                

image.png

And save locally.¶

In [30]:
df3.write.format("json").mode("overwrite").save("/home/jovyan/datasets/nces.json")
                                                                                

Go to Part III to see all 3 datasets saved to Elasticsearch and an example of the 2 dashboards generated.¶

In [38]:
from pyspark.sql import SparkSession

# Initialize Spark Session with Cassandra connector
spark = SparkSession.builder \
    .appName("JoinChildcareAndNcesKeyspaces") \
    .config("spark.cassandra.connection.host", "cassandra") \
    .config("spark.cassandra.connection.port", "9042") \
    .config("spark.jars.packages", "com.datastax.spark:spark-cassandra-connector_2.12:3.1.0") \
    .getOrCreate()

# Manually list tables for childcare and nces keyspaces
childcare_tables = ["childcare_prices"]  
nces_tables = ["nces_income"] 

# Helper function to load a table from Cassandra as a Spark DataFrame
def load_table_as_dataframe(keyspace, table):
    return spark.read \
        .format("org.apache.spark.sql.cassandra") \
        .options(keyspace=keyspace, table=table) \
        .load()

# Load and filter tables with "state_name" from childcare keyspace
childcare_dfs = {}
for table in childcare_tables:
    df = load_table_as_dataframe("childcare", table)
    if "state_name" in df.columns:
        childcare_dfs[table] = df

# Load and filter tables with "state_name" from nces keyspace
nces_dfs = {}
for table in nces_tables:
    df = load_table_as_dataframe("nces", table)
    if "state_name" in df.columns:
        nces_dfs[table] = df

# Start joining tables by "state_name"
joined_df = None

# Join all DataFrames from childcare keyspace
for table, df in childcare_dfs.items():
    if joined_df is None:
        joined_df = df
    else:
        joined_df = joined_df.join(df, on="state_name", how="inner")

# Join with DataFrames from nces keyspace
for table, df in nces_dfs.items():
    joined_df = joined_df.join(df, on="state_name", how="inner")

# Show the final joined DataFrame
if joined_df:
     joined_pandas_df = joined_df.toPandas()

# Stop Spark session
spark.stop()
                                                                                
In [44]:
joined_pandas_df[joined_pandas_df['state_name'] == 'New Hampshire']
Out[44]:
state_name county_name county_fips_code infant_center_based_price_2018_dollar infant_center_based_price_2023_estimated_dollar infant_center_based_price_as_share_of_family_income_2018_percent infant_home_based_price_2018_dollar infant_home_based_price_2023_estimated_dollar infant_home_based_price_as_share_of_family_income_2018_percent median_family_income_dollar percent_asian_percent percent_black_percent percent_hispanic_of_any_race_percent percent_of_families_in_poverty_percent percent_white_percent preschool_center_based_price_2018_dollar preschool_center_based_price_2023_estimated_dollar preschool_center_based_price_as_share_of_family_income_2018_percent preschool_home_based_price_2018_dollar preschool_home_based_price_2023_estimated_dollar preschool_home_based_price_as_share_of_family_income_2018_percent reference_sheetname school_age_center_based_price_2018_dollar school_age_center_based_price_2023_estimated_dollar school_age_center_based_price_as_share_of_family_income_2018_percent school_age_home_based_price_2018_dollar school_age_home_based_price_2023_estimated_dollar school_age_home_based_price_as_share_of_family_income_2018_percent toddler_center_based_price_2018_dollar toddler_center_based_price_2023_estimated_dollar toddler_center_based_price_as_share_of_family_income_2018_percent toddler_home_based_price_2018_dollar toddler_home_based_price_2023_estimated_dollar toddler_home_based_price_as_share_of_family_income_2018_percent total_population womens_labor_force_participation_rate_percent womens_median_earnings_dollar median_income_1990 median_income_2000 median_income_2005 median_income_2010 median_income_2015 median_income_2016 median_income_2017 median_income_2018 median_income_2019 median_income_2020 median_income_2021 std_error_2005 std_error_2010 std_error_2015 std_error_2016 std_error_2017 std_error_2018 std_error_2019 std_error_2020 std_error_2021
999 New Hampshire Belknap County 33001 10860.72 12783.633842 13.128862 7986.16 9400.126810 9.653982 82724.0 0.8 0.7 1.6 5.5 96.5 8966.88 10554.485396 10.839514 7986.16 9400.126810 9.653982 NewHampshire 8670.48 10205.607138 10.481215 7986.16 9400.126810 9.653982 10039.64 11817.179862 12.136309 7986.16 9400.126810 9.653982 60640 77.4 30781.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
1000 New Hampshire Carroll County 33003 10914.80 12847.288822 15.219052 8025.68 9446.643911 11.190608 71718.0 0.7 0.5 1.4 4.8 97.2 9011.60 10607.123168 12.565325 8025.68 9446.643911 11.190608 NewHampshire 8713.64 10256.408709 12.149865 8025.68 9446.643911 11.190608 10089.04 11875.326238 14.067654 8025.68 9446.643911 11.190608 47840 74.3 26358.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
1001 New Hampshire Cheshire County 33005 10904.40 12835.047480 13.635613 8017.88 9437.462904 10.026110 79970.0 1.4 0.8 1.8 5.3 95.3 9002.76 10596.718027 11.257672 8017.88 9437.462904 10.026110 NewHampshire 8705.32 10246.615635 10.885732 8017.88 9437.462904 10.026110 10079.68 11864.309030 12.604327 8017.88 9437.462904 10.026110 76263 78.4 26174.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
1002 New Hampshire Coos County 33007 8698.04 10238.046695 14.266800 6395.48 7527.813493 10.490068 60967.0 0.6 2.9 2.9 8.4 93.9 7181.20 8452.646910 11.778831 6395.48 7527.813493 10.490068 NewHampshire 6943.56 8172.932237 11.389046 6395.48 7527.813493 10.490068 8040.24 9463.781790 13.187856 6395.48 7527.813493 10.490068 32038 78.1 25035.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
1003 New Hampshire Grafton County 33009 11322.48 13327.149443 14.008983 8325.20 9799.194571 10.300533 80823.0 3.7 1.2 2.3 6.4 92.5 9348.04 11003.130594 11.566064 8325.20 9799.194571 10.300533 NewHampshire 9039.16 10639.562725 11.183896 8325.20 9799.194571 10.300533 10466.04 12319.074899 12.949334 8325.20 9799.194571 10.300533 89811 76.2 26663.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
1004 New Hampshire Hillsborough County 33011 12736.88 14991.972006 13.286960 9365.20 11023.328809 9.769664 95860.0 4.1 2.8 6.5 5.4 89.7 10515.96 12377.833342 10.970123 9365.20 11023.328809 9.769664 NewHampshire 10168.08 11968.360440 10.607219 9365.20 11023.328809 9.769664 11773.32 13857.811635 12.281786 9365.20 11023.328809 9.769664 411087 77.8 33250.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
1005 New Hampshire Merrimack County 33013 11683.88 13752.536091 12.989450 8590.92 10111.960869 9.550879 89949.0 2.1 1.3 2.1 4.6 94.2 9646.52 11354.457120 10.724433 8590.92 10111.960869 9.550879 NewHampshire 9327.24 10978.647909 10.369476 8590.92 10111.960869 9.550879 10799.88 12712.021989 12.006670 8590.92 10111.960869 9.550879 149452 78.7 31400.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
1006 New Hampshire Rockingham County 33015 13253.76 15600.366722 12.290208 9745.32 11470.749872 9.036832 107840.0 2.0 0.8 2.9 3.0 94.8 10942.88 12880.340447 10.147329 9745.32 11470.749872 9.036832 NewHampshire 10580.96 12454.341732 9.811721 9745.32 11470.749872 9.036832 12251.72 14420.913385 11.361016 9745.32 11470.749872 9.036832 305129 81.4 35170.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
1007 New Hampshire Strafford County 33017 11847.68 13945.337233 13.487335 8711.56 10253.960440 9.917193 87843.0 3.6 0.9 2.3 5.4 92.8 9781.72 11513.594571 11.135458 8711.56 10253.960440 9.917193 NewHampshire 9458.28 11132.888823 10.767255 8711.56 10253.960440 9.917193 10951.72 12890.745588 12.467379 8711.56 10253.960440 9.917193 128237 76.5 27104.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
1008 New Hampshire Sullivan County 33019 10393.24 12233.385502 14.018587 7641.92 8994.938377 10.307557 74139.0 0.8 0.8 1.5 6.9 96.4 8581.04 10100.331593 11.574259 7641.92 8994.938377 10.307557 NewHampshire 8297.12 9766.142947 11.191303 7641.92 8994.938377 10.307557 9607.00 11307.940019 12.958092 7641.92 8994.938377 10.307557 43125 76.6 32496.0 76900.0 80700.0 78900.0 76000.0 80400.0 80100.0 81100.0 80900.0 82600.0 NaN 88500.0 840.0 870.0 870.0 980.0 1140.0 1020.0 1280.0 NaN 1300.0
In [53]:
import matplotlib.pyplot as plt
import numpy as np

# Filter the DataFrame for rows where state_name is 'New Hampshire'
new_hampshire_data = joined_pandas_df[joined_pandas_df['state_name'] == 'New Hampshire']

# Extract years and median incomes
years = np.array([1990, 2000, 2010, 2015, 2020, 2021])
median_incomes = np.array([
    new_hampshire_data['median_income_1990'].values[0],
    new_hampshire_data['median_income_2000'].values[0],
    new_hampshire_data['median_income_2010'].values[0],
    new_hampshire_data['median_income_2015'].values[0],
    new_hampshire_data['median_income_2020'].values[0],
    new_hampshire_data['median_income_2021'].values[0]
])

# Interpolation
interpolation_years = np.arange(2015, 2021)  # Generate years 2015 through 2020
interpolated_incomes = np.interp(interpolation_years, years, median_incomes)  # Interpolate between 2015 and 2020

# Plot the trends
plt.figure(figsize=(10, 6))
plt.plot(years, median_incomes, marker='o', label='Known Median Income', linestyle='-', color='blue')
plt.plot(interpolation_years, interpolated_incomes, linestyle='--', label='Interpolated Trend (2015-2020)', color='orange')
plt.title('Median Income Trends in New Hampshire (With Interpolation)', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Median Income ($)', fontsize=12)
plt.grid(True)
plt.legend()
plt.show()
No description has been provided for this image

Observations¶

  • 1990 to 2000:

    • There is a steady increase in median income, indicating economic growth and improved earning capacity in New Hampshire during this decade.
  • 2000 to 2010:

    • The median income shows a significant drop, which may align with broader economic trends like the Great Recession (2008). This suggests that households in New Hampshire were impacted by national economic challenges.
  • 2010 to 2015:

    • The median income continues to decline further, albeit at a slower pace compared to the previous decade. This could indicate a delayed recovery from the recession or other economic factors affecting the state.
  • 2015 to 2020:

    • There is a sharp recovery in median income, with income levels rebounding strongly. This might reflect an improving economy, increased job opportunities, and better wages during this period.
  • 2020 to 2021:

    • The income stabilizes or slightly dips, possibly reflecting the economic impact of the COVID-19 pandemic. This suggests some level of stagnation or challenges in maintaining economic growth during this period.

Observed Growth Pattern (2015 to 2020): The interpolated median income values between 2015 and 2020 indicate a steady increase in income levels during this period. In 2015, the median income starts at approximately 82,000, and by 2020, it rises to around 88,000, reflecting a 6-7% growth over five years.

Annual Growth Rate: Using the interpolated data, the estimated annual growth rate for median income in New Hampshire from 2015 to 2020 is approximately 1.2% per year.

In [51]:
import matplotlib.pyplot as plt

# Extract relevant columns for New Hampshire
new_hampshire_data = joined_pandas_df[joined_pandas_df['state_name'] == 'New Hampshire']

# Extract data for costs and median family income
infant_center_cost = new_hampshire_data['infant_center_based_price_2018_dollar'].values[0]
infant_home_cost = new_hampshire_data['infant_home_based_price_2018_dollar'].values[0]
median_family_income = new_hampshire_data['median_family_income_dollar'].values[0]

# Calculate childcare costs as a percentage of median family income
center_cost_share = (infant_center_cost / median_family_income) * 100
home_cost_share = (infant_home_cost / median_family_income) * 100

# Plot comparison
plt.figure(figsize=(8, 5))
plt.bar(['Center-Based Care', 'Home-Based Care'], [center_cost_share, home_cost_share], color=['blue', 'green'])

# Add labels and titles
plt.title('Infant Childcare Costs as a Percentage of Median Family Income in New Hampshire (2018)', fontsize=14)
plt.ylabel('Percentage of Median Income (%)', fontsize=12)
plt.ylim(0, __builtins__.max([center_cost_share, home_cost_share]) + 5)  # Explicitly use Python's built-in max
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Annotate values on bars
plt.text(0, center_cost_share + 1, f'{center_cost_share:.1f}%', ha='center', fontsize=12)
plt.text(1, home_cost_share + 1, f'{home_cost_share:.1f}%', ha='center', fontsize=12)

plt.show()
No description has been provided for this image

Observations

The bar graph illustrates Infant Childcare Costs as a Percentage of Median Family Income in New Hampshire for the year 2018, broken down into:

  • Center-Based Care: Costs account for 13.1% of the median family income.
  • Home-Based Care: Costs account for 9.7% of the median family income.

This comparison reveals that Center-Based Care is significantly more expensive, consuming a larger share of household income compared to Home-Based Care.

Income Growth vs. Childcare Costs:

Between 2015 and 2020, median income showed consistent growth, indicating improved financial capacity for families. However, childcare costs in 2018 remain substantial, suggesting that income growth has not fully offset the high costs of childcare.

Trend Alignment:

While median incomes are on an upward trajectory, the percentage of income spent on childcare demonstrates a gap between income growth and rising childcare expenses. If childcare costs continue to rise faster than income, this trend may become unsustainable for many families.

Insights

Center-Based Care is notably more expensive than Home-Based Care, requiring a larger share of household income. Despite the upward trend in income levels between 2015 and 2020, childcare costs remain a significant financial burden for families in New Hampshire. To address this issue, policymakers and stakeholders should consider strategies to reduce childcare costs or align them more closely with income growth to enhance affordability for families.

In [ ]: